DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
arXiv cs.CV / 4/2/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces a new Vision-Geometry-Action (VGA) paradigm for autonomous driving that emphasizes dense 3D geometry as the primary cue for decision-making rather than sparse perception or language-augmented planning used in VLA models.
- It proposes DVGT-2, a streaming Driving Visual Geometry Transformer that performs online inference by outputting dense geometry and trajectory planning for the current frame.
- DVGT-2 achieves real-time applicability by using temporal causal attention, caching historical features, and a sliding-window streaming strategy to reduce repetitive computation.
- The method reports improved dense geometry reconstruction performance while maintaining faster speed across multiple datasets.
- A key claim is transferability: the same trained DVGT-2 can be applied to planning across different camera configurations without fine-tuning, validated on closed-loop NAVSIM and open-loop nuScenes benchmarks.
Related Articles

Black Hat Asia
AI Business

Unitree's IPO
ChinaTalk

Did you know your GIGABYTE laptop has a built-in AI coding assistant? Meet GiMATE Coder 🤖
Dev.to

Benchmarking Batch Deep Reinforcement Learning Algorithms
Dev.to
A bug in Bun may have been the root cause of the Claude Code source code leak.
Reddit r/LocalLLaMA