Build on Priors: Vision--Language--Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation

arXiv cs.RO / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles data-efficient long-horizon robot manipulation by proposing an automated neuro-symbolic imitation learning pipeline that works from as few as one to thirty unannotated skill demonstrations.
  • It segments demonstrations into skills, then uses a vision-language model (VLM) to classify skills and discover equivalent high-level states, forming an automatically built state-transition graph.
  • An Answer Set Programming solver converts this graph into a synthesized PDDL planning domain, which is further used to isolate minimal, task-relevant observation/action spaces for each skill policy.
  • Unlike end-to-end raw actuator imitation, the method learns at a control-reference level to produce smoother targets and reduce noisy learning signals.
  • The approach validates on an industrial forklift with statistically rigorous trials and shows cross-platform generality on a Kinova Gen3 arm, highlighting scalability, expert-free setup, and interpretability.

Abstract

Enabling robots to learn long-horizon manipulation tasks from a handful of demonstrations remains a central challenge in robotics. Existing neuro-symbolic approaches often rely on hand-crafted symbolic abstractions, semantically labeled trajectories or large demonstration datasets, limiting their scalability and real-world applicability. We present a scalable neuro-symbolic framework that autonomously constructs symbolic planning domains and data-efficient control policies from as few as one to thirty unannotated skill demonstrations, without requiring manual domain engineering. Our method segments demonstrations into skills and employs a Vision-Language Model (VLM) to classify skills and identify equivalent high-level states, enabling automatic construction of a state-transition graph. This graph is processed by an Answer Set Programming solver to synthesize a PDDL planning domain, which an oracle function exploits to isolate the minimal, task-relevant and target relative observation and action spaces for each skill policy. Policies are learned at the control reference level rather than at the raw actuator signal level, yielding a smoother and less noisy learning target. Known controllers can be leveraged for real-world data augmentation by projecting a single demonstration onto other objects in the scene, simultaneously enriching the graph construction process and the dataset for imitation learning. We validate our framework primarily on a real industrial forklift across statistically rigorous manipulation trials, and demonstrate cross-platform generality on a Kinova Gen3 robotic arm across two standard benchmarks. Our results show that grounding control learning, VLM-driven abstraction, and automated planning synthesis into a unified pipeline constitutes a practical path toward scalable, data-efficient, expert-free and interpretable neuro-symbolic robotics.