Scaling Cross-Environment Failure Reasoning Data for Vision-Language Robotic Manipulation

arXiv cs.RO / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes an automatic framework to scale diverse robotic manipulation failure cases across simulation and real-world settings by perturbing successful trajectories to match realistic failure distributions.
  • It introduces FailCoT, a large-scale failure reasoning dataset generated using a vision-language model to create structured step-by-step reasoning traces, built from RLBench and BridgeDataV2.
  • Using FailCoT, the authors train Guardian, a multi-view reasoning VLM designed to unify planning and execution verification for robust failure detection and recovery.
  • Guardian achieves state-of-the-art results on three unseen real-world benchmarks (RoboFail, RoboVQA, and a newly introduced UR5-Fail).
  • When combined with an LLM-based manipulation policy, Guardian reliably improves task success rates in both simulation and real-world deployments, highlighting the importance of high-quality failure reasoning data for generalization.

Abstract

Robust robotic manipulation requires reliable failure detection and recovery. Although recent Vision-Language Models (VLMs) show promise in robot failure detection, their generalization is severely limited by the scarcity and narrow coverage of failure data. To address this bottleneck, we propose an automatic framework for generating diverse robotic planning and execution failures across both simulated and real-world environments. Our approach perturbs successful manipulation trajectories to synthesize failures that reflect realistic failure distributions, and leverages VLMs to produce structured step-by-step reasoning traces. This yields FailCoT, a large-scale failure reasoning dataset built upon the RLBench simulator and the BridgeDataV2 real-robot dataset. Using FailCoT, we train Guardian, a multi-view reasoning VLM for unified planning and execution verification. Guardian achieves state-of-the-art performance on three unseen real-world benchmarks: RoboFail, RoboVQA, and our newly introduced UR5-Fail. When integrated with a state-of-the-art LLM-based manipulation policy, it consistently boosts task success rates in both simulation and real-world deployment. These results demonstrate that scaling high-quality failure reasoning data is critical for improving generalization in robotic failure detection. Code, Data, and Models available at https://www.di.ens.fr/willow/research/guardian/.

Scaling Cross-Environment Failure Reasoning Data for Vision-Language Robotic Manipulation | AI Navigate