DriveVA: Video Action Models are Zero-Shot Drivers
arXiv cs.RO / 4/7/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- DriveVA is an autonomous-driving world model designed to improve generalization under unseen scenarios, sensor domains, and environmental conditions by jointly modeling future video forecasts and action/trajectory sequences.
- The approach addresses limitations of prior world-model planners by using a shared latent generative process, with a DiT-based decoder to better align visual imagination with planned actions and improve video–trajectory consistency.
- DriveVA leverages priors from large-scale pretrained video generation models to capture continuous spatiotemporal evolution and physically plausible motion dynamics.
- A “video continuation” rollout strategy is introduced to strengthen long-duration closed-loop prediction consistency.
- In experiments, DriveVA reports strong closed-loop performance (90.9 PDM on NAVSIM) and large improvements over state-of-the-art on nuScenes and Bench2drive/CARLA v2, including reported reductions in L2 error and collision rates, alongside zero-shot and cross-domain generalization results.
Related Articles

Black Hat Asia
AI Business

Fully Automated Website 2026-04-11: **The Scoreboard — Visual Judge Score Comparison on the Homepage**
Dev.to
Human-Aligned Decision Transformers for satellite anomaly response operations with ethical auditability baked in
Dev.to

That Smoking-Gun Video? It's Not Evidence. It's a Suspect.
Dev.to

AI Citation Registries and Website-Based Publishing Constraints
Dev.to