VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
arXiv cs.RO / 2026/3/24
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- Vision-Language-Action (VLA) models often rely on a single “black-box” mapping from image+instruction to control signals, which can hurt spatial precision and robustness in out-of-distribution settings.
- The proposed VP-VLA framework introduces a structured visual prompting interface that decouples high-level planning from low-level control.
- A “System 2 Planner” breaks instructions into sub-tasks, identifies target objects and goal locations, and converts these into spatial anchors (e.g., crosshairs/bounding boxes) overlaid on the observations.
- A “System 1 Controller,” trained with an auxiliary visual grounding objective, uses these prompts to generate more reliable and precise low-level execution motions.
- Experiments on Robocasa-GR1-Tabletop and SimplerEnv show improved success rates (+5% and +8.3%) and better results than several competitive baselines, including QwenOFT and GR00T-N1.6.
