StageCraft: Execution Aware Mitigation of Distractor and Obstruction Failures in VLA Models

arXiv cs.RO / 3/24/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper highlights that Vision Language Action (VLA) models can fail during execution when distractors or physical obstructions appear in the robot workspace, especially in unseen settings.
It proposes StageCraft, a training-free, plug-and-play method that uses VLM in-context reasoning to modify the environment’s initial state to prevent anticipated execution failures.
StageCraft takes policy rollout videos and success/failure labels, inferring which initial-state objects should be manipulated to avoid obstruction- or distractor-related breakdowns.
Experiments report an absolute 40% performance improvement across three real-world domains with diverse distractors and obstructions, and RLBench simulation results show intervention strength adapts to the underlying policy and improves with more in-context samples.

Abstract

Large scale pre-training on text and image data along with diverse robot demonstrations has helped Vision Language Action models (VLAs) to generalize to novel tasks, objects and scenes. However, these models are still susceptible to failure in the presence of execution-time impediments such as distractors and physical obstructions in the robot's workspace. Existing policy improvement methods finetune base VLAs to improve generalization, yet they still struggle in unseen distractor settings. To address this problem, we investigate whether internet-scale pretraining of large vision-language models (VLMs) can be leveraged to reason about these impediments and mitigate policy failures. To this end, we propose StageCraft, a training-free approach to improve pretrained VLA policy performance by manipulating the environment's initial state using VLM-based in-context reasoning. StageCraft takes policy rollout videos and success labels as input and leverages VLM's reasoning ability to infer which objects in the initial state need to be manipulated to avoid anticipated execution failures. StageCraft is an extensible plug-and-play module that does not introduce additional constraints on the underlying policy, and only requires a few policy rollouts to work. We evaluate performance of state-of-the-art VLA models with StageCraft and show an absolute 40% performance improvement across three real world task domains involving diverse distractors and obstructions. Our simulation experiments in RLBench empirically show that StageCraft tailors its extent of intervention based on the strength of the underlying policy and improves its performance with more in-context samples. Videos of StageCraft in effect can be found at https://stagecraft-decorator.github.io/stagecraft/ .