AI Navigate

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

arXiv cs.AI / 3/12/2026

📰 NewsModels & Research

Key Points

  • AR-VLA introduces a standalone autoregressive Action Expert that generates actions as a continuous causal sequence with a long-lived memory, improving context-awareness over existing vision-language-action models.
  • It features a re-anchoring mechanism to account for perception staleness and to synchronize asynchronous vision-language-action modalities during training and inference.
  • Experiments on simulated and real-robot manipulation tasks show AR-VLA can replace chunk-based action heads while delivering smoother trajectories and comparable or higher task success than state-of-the-art reactive VLAs.
  • The approach enables independent pretraining of kinematic syntax and modular integration with heavy perception backbones, addressing the fast control/slow reasoning frequency mismatch in robotics policies.

Abstract

We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies.