Judge, Then Drive: A Critic-Centric Vision Language Action Framework for Autonomous Driving

arXiv cs.CV / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces CriticVLA, a two-stage vision-language-action framework for autonomous driving that explicitly uses the model’s critic capability rather than only acting on inputs.
CriticVLA first proposes a rough trajectory and then refines it via multimodal evaluation and single-step optimization guided by a VLA-based critic, improving closed-loop decision quality.
To strengthen the critic’s reasoning, the authors build a large synthetic dataset with 12.9 million annotated trajectories across diverse driving scenarios.
Experiments on the Bench2Drive benchmark demonstrate that CriticVLA outperforms existing state-of-the-art methods, reaching a 73.33% total success rate and roughly 30% gains in difficult scenarios.

Abstract

Recent advances in vision language action (VLA) models have shown remarkable potential for autonomous driving by directly mapping multimodal inputs to control signals. However, previous VLA-based methods have not explicitly exploited the critic capability of VLAs to refine driving decisions, even though such capability has been well demonstrated in other LLM-based domains, thereby limiting their performance in complex closed-loop scenarios. In this work, we present a theoretically inspired two-stage framework, CriticVLA, which extends the role of VLAs from acting to judging. CriticVLA first generates a rough trajectory and then refines it through multimodal evaluation and single-step optimization guided by a VLA-based critic, yielding higher-quality driving behaviors. To support this process, we construct a large-scale synthetic dataset of 12.9 million annotated trajectories covering diverse driving scenarios, which enhances the critic's reasoning and refinement abilities. Extensive closed-loop experiments on the Bench2Drive benchmark show that CriticVLA significantly surpasses state-of-the-art baselines, achieving a 73.33% total success rate and delivering about 30% improvement in challenging scenarios.