DriveVA: Video Action Models are Zero-Shot Drivers

arXiv cs.RO / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • DriveVA is an autonomous-driving world model designed to improve generalization under unseen scenarios, sensor domains, and environmental conditions by jointly modeling future video forecasts and action/trajectory sequences.
  • The approach addresses limitations of prior world-model planners by using a shared latent generative process, with a DiT-based decoder to better align visual imagination with planned actions and improve video–trajectory consistency.
  • DriveVA leverages priors from large-scale pretrained video generation models to capture continuous spatiotemporal evolution and physically plausible motion dynamics.
  • A “video continuation” rollout strategy is introduced to strengthen long-duration closed-loop prediction consistency.
  • In experiments, DriveVA reports strong closed-loop performance (90.9 PDM on NAVSIM) and large improvements over state-of-the-art on nuScenes and Bench2drive/CARLA v2, including reported reductions in L2 error and collision rates, alongside zero-shot and cross-domain generalization results.

Abstract

Generalization is a central challenge in autonomous driving, as real-world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world-model-based planning methods have shown strong capabilities in scene understanding and multi-modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video-trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well-pretrained large-scale video generation models to capture continuous spatiotemporal evolution and causal interaction patterns. To this end, DriveVA employs a DiT-based decoder to jointly predict future action sequences (trajectories) and videos, enabling tighter alignment between planning and scene evolution. We also introduce a video continuation strategy to strengthen long-duration rollout consistency. DriveVA achieves an impressive closed-loop performance of 90.9 PDM score on the challenge NAVSIM. Extensive experiments also demonstrate the zero-shot capability and cross-domain generalization of DriveVA, which reduces average L2 error and collision rate by 78.9% and 83.3% on nuScenes and 52.5% and 52.4% on the Bench2drive built on CARLA v2 compared with the state-of-the-art world-model-based planner.