V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation

arXiv cs.RO / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • V-CAGE is an agentic framework for autonomous robotic data synthesis aimed at scaling Vision-Language-Action (VLA) training while keeping generated scenes both semantically coherent and physically reachable.
  • It uses inpainting-guided scene construction to create context-aware, layout-structured environments and reduces task failures caused by unreachable target positions.
  • The system integrates functional metadata with a vision-language model closed-loop “visual critic” to verify trajectory correctness and filter silent failures before they propagate.
  • To address massive video dataset storage costs, V-CAGE introduces a perceptually driven compression method that reportedly reduces file sizes by over 90% while preserving downstream VLA training effectiveness.

Abstract

Scaling Vision-Language-Action (VLA) models requires massive datasets that are both semantically coherent and physically feasible. However, existing scene generation methods often lack context-awareness, making it difficult to synthesize high-fidelity environments embedded with rich semantic information, frequently resulting in unreachable target positions that cause tasks to fail prematurely. We present V-CAGE (Vision-Closed-loop Agentic Generation Engine), an agentic framework for autonomous robotic data synthesis. Unlike traditional scripted pipelines, V-CAGE operates as an embodied agentic system, leveraging foundation models to bridge high-level semantic reasoning with low-level physical interaction. Specifically, we introduce Inpainting-Guided Scene Construction to systematically arrange context-aware layouts, ensuring that the generated scenes are both semantically structured and kinematically reachable. To ensure trajectory correctness, we integrate functional metadata with a Vision-Language Model based closed-loop verification mechanism, acting as a visual critic to rigorously filter out silent failures and sever the error propagation chain. Finally, to overcome the storage bottleneck of massive video datasets, we implement a perceptually-driven compression algorithm that achieves over 90\% filesize reduction without compromising downstream VLA training efficacy. By centralizing semantic layout planning and visual self-verification, V-CAGE automates the end-to-end pipeline, enabling the highly scalable synthesis of diverse, high-quality robotic manipulation datasets.