Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision

arXiv cs.RO / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a neuro-symbolic method to specialize vision-language models (VLMs) so they generate interpretable, executable structured robot policies rather than opaque end-to-end visuomotor control.
  • It uses Behavior Tree policies as the structured representation, grounding decision-making in multimodal visual observations, natural-language instructions, and formal system specifications.
  • To avoid costly manual labeling, the authors introduce an automated synthetic supervision pipeline that creates domain-randomized multimodal scenes paired with instruction-to-policy examples generated by foundation models.
  • Experiments on two robotic manipulators reportedly show that policies learned entirely from synthetic supervision can transfer successfully to real physical robots.
  • Overall, the work argues that foundation models can be adapted to produce modular and safety-friendlier robot behavior policies bridging high-dimensional learning and symbolic control.

Abstract

Vision-language models (VLMs) have recently demonstrated strong capabilities in mapping multimodal observations to robot behaviors. However, most current approaches rely on end-to-end visuomotor policies that remain opaque and difficult to analyze, limiting their use in safety-critical robotic applications. In contrast, classical robotic systems often rely on structured policy representations that provide interpretability, modularity, and reactive execution. This work investigates how foundation models can be specialized to generate structured robot policies grounded in multimodal perception, bridging high-dimensional learning and symbolic control. We propose a neuro-symbolic approach in which a VLM synthesizes executable Behavior Tree policies from visual observations, natural language instructions, and structured system specifications. To enable scalable supervision without manual annotation, we introduce an automated pipeline that generates a synthetic multimodal dataset of domain-randomized scenes paired with instruction-policy examples produced by a foundation model. Real-world experiments on two robotic manipulators show that structured policies learned entirely from synthetic supervision transfer successfully to physical systems. The results indicate that foundation models can be adapted to produce interpretable and structured robot policies, providing an alternative to opaque end-to-end approaches for multimodal robot decision making.