AI Navigate

Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

arXiv cs.CV / 3/20/2026

📰 NewsModels & Research

Key Points

  • Action Draft and Verify (ADV) presents a self-verifying framework for Vision-Language-Action models that combines diffusion-based action drafting with a verification step.
  • ADV drafts multiple candidate action chunks using a diffusion action expert and ranks them via a perplexity-style metric in a single forward pass of the vision-language model.
  • When trained with matched backbones, data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world settings over diffusion-based baselines, with only a single-pass VLM reranking overhead.
  • By integrating diffusion-based and auto-regressive priors, ADV aims to enhance robustness and generalization for embodied tasks in out-of-distribution environments.

Abstract

Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.