VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

arXiv cs.RO / 2026/3/24

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • Vision-Language-Action (VLA) models often rely on a single “black-box” mapping from image+instruction to control signals, which can hurt spatial precision and robustness in out-of-distribution settings.
  • The proposed VP-VLA framework introduces a structured visual prompting interface that decouples high-level planning from low-level control.
  • A “System 2 Planner” breaks instructions into sub-tasks, identifies target objects and goal locations, and converts these into spatial anchors (e.g., crosshairs/bounding boxes) overlaid on the observations.
  • A “System 1 Controller,” trained with an auxiliary visual grounding objective, uses these prompts to generate more reliable and precise low-level execution motions.
  • Experiments on Robocasa-GR1-Tabletop and SimplerEnv show improved success rates (+5% and +8.3%) and better results than several competitive baselines, including QwenOFT and GR00T-N1.6.

Abstract

Vision-Language-Action (VLA) models typically map visual observations and linguistic instructions directly to robotic control signals. This "black-box" mapping forces a single forward pass to simultaneously handle instruction interpretation, spatial grounding, and low-level control, often leading to poor spatial precision and limited robustness in out-of-distribution scenarios. To address these limitations, we propose VP-VLA, a dual-system framework that decouples high-level reasoning and low-level execution via a structured visual prompting interface. Specifically, a "System 2 Planner" decomposes complex instructions into sub-tasks and identifies relevant target objects and goal locations. These spatial anchors are then overlaid directly onto visual observations as structured visual prompts, such as crosshairs and bounding boxes. Guided by these prompts and enhanced by a novel auxiliary visual grounding objective during training, a "System 1 Controller" reliably generates precise low-level execution motions. Experiments on the Robocasa-GR1-Tabletop benchmark and SimplerEnv simulation demonstrate that VP-VLA improves success rates by 5% and 8.3%, surpassing competitive baselines including QwenOFT and GR00T-N1.6.