Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

arXiv cs.RO / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses how to better ground pretrained vision-language model knowledge into actual robot behaviors, an open problem in embodied robotics control.
It proposes “Steerable Policies,” vision-language-action policies trained with rich synthetic commands spanning multiple abstraction levels (e.g., subtasks, motions, and pixel-level grounded coordinates) to improve low-level controllability.
The method aims to let VLM reasoning more directly steer robot actions than standard hierarchical setups that rely on natural-language interfaces between VLMs and low-level policies.
The authors test two high-level command sources—a learned embodied reasoner and an off-the-shelf VLM using in-context learning—to drive the steerable policies.
Results on extensive real-world manipulation experiments show improved generalization and long-horizon performance over prior embodied reasoning VLAs and VLM-based hierarchical baselines.

Abstract

Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks. Website: steerable-policies.github.io