VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model
arXiv cs.CV / 3/16/2026
📰 NewsModels & Research
Key Points
- VGGT-World introduces a geometry-first world model that forecasts scene evolution by predicting future geometry features instead of generating photorealistic video frames.
- It repurposes the latent tokens of a frozen VGGT as the world state and trains a lightweight temporal flow transformer to autoregressively predict their future trajectory.
- To address the high-dimensional feature space (d=1024), the paper employs a clean-target z-prediction parameterization and a two-stage latent flow-forcing curriculum to mitigate velocity-prediction collapse and exposure bias.
- Experiments on KITTI, Cityscapes, and TartanAir show that VGGT-World significantly outperforms strong baselines in depth forecasting, runs 3.6–5x faster, uses only 0.43B trainable parameters, and demonstrates that frozen GFM features are an effective predictive state for 3D world modeling.
Related Articles

Chip Smuggling Arrests, OpenClaw Is 'The Next ChatGPT,' and 81K People on AI
Dev.to
The Lemma
Dev.to
Your Agent Will Eventually Do Something Catastrophic. Here's How to Prevent It.
Dev.to
[D] Modeling online discourse escalation as a state machine (dataset + labeling approach)
Reddit r/MachineLearning
[R] Is this paper Nonsense ? [DCdetector: Dual Attention Contrastive Representation Learning for Time Series Anomaly Detection]
Reddit r/MachineLearning