VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model
arXiv cs.CV / 3/16/2026
📰 NewsModels & Research
Key Points
- VGGT-World introduces a geometry-first world model that forecasts scene evolution by predicting future geometry features instead of generating photorealistic video frames.
- It repurposes the latent tokens of a frozen VGGT as the world state and trains a lightweight temporal flow transformer to autoregressively predict their future trajectory.
- To address the high-dimensional feature space (d=1024), the paper employs a clean-target z-prediction parameterization and a two-stage latent flow-forcing curriculum to mitigate velocity-prediction collapse and exposure bias.
- Experiments on KITTI, Cityscapes, and TartanAir show that VGGT-World significantly outperforms strong baselines in depth forecasting, runs 3.6–5x faster, uses only 0.43B trainable parameters, and demonstrates that frozen GFM features are an effective predictive state for 3D world modeling.
Related Articles

報告:LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測
note

諸葛亮 孔明老師(ChatGPTのロールプレイ)との対話 その肆拾伍『銀河文明・ダークマターエンジン』
note

GPT-5.4 mini/nano登場!―2倍高速で無料プランも使える小型高性能モデル
note

Why a Perfect-Memory AI Agent Without Persona Drift is Architecturally Impossible
Dev.to
OCP: Orthogonal Constrained Projection for Sparse Scaling in Industrial Commodity Recommendation
arXiv cs.LG