VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

arXiv cs.CV / 3/16/2026

📰 NewsModels & Research

共有:

Key Points

VGGT-World introduces a geometry-first world model that forecasts scene evolution by predicting future geometry features instead of generating photorealistic video frames.
It repurposes the latent tokens of a frozen VGGT as the world state and trains a lightweight temporal flow transformer to autoregressively predict their future trajectory.
To address the high-dimensional feature space (d=1024), the paper employs a clean-target z-prediction parameterization and a two-stage latent flow-forcing curriculum to mitigate velocity-prediction collapse and exposure bias.
Experiments on KITTI, Cityscapes, and TartanAir show that VGGT-World significantly outperforms strong baselines in depth forecasting, runs 3.6–5x faster, uses only 0.43B trainable parameters, and demonstrates that frozen GFM features are an effective predictive state for 3D world modeling.

Abstract

World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT-World, a geometry world model that side-steps video generation entirely and instead forecasts the temporal evolution of frozen geometry-foundation-model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight temporal flow transformer to autoregressively predict their future trajectory. Two technical challenges arise in this high-dimensional (d=1024) feature space: (i) standard velocity-prediction flow matching collapses, and (ii) autoregressive rollout suffers from compounding exposure bias. We address the first with a clean-target (z-prediction) parameterization that yields a substantially higher signal-to-noise ratio, and the second with a two-stage latent flow-forcing curriculum that progressively conditions the model on its own partially denoised rollouts. Experiments on KITTI, Cityscapes, and TartanAir demonstrate that VGGT-World significantly outperforms the strongest baselines in depth forecasting while running 3.6-5 times faster with only 0.43B trainable parameters, establishing frozen GFM features as an effective and efficient predictive state for 3D world modeling.

Chip Smuggling Arrests, OpenClaw Is 'The Next ChatGPT,' and 81K People on AI

Dev.to

The Lemma

Dev.to

Your Agent Will Eventually Do Something Catastrophic. Here's How to Prevent It.

Dev.to

[D] Modeling online discourse escalation as a state machine (dataset + labeling approach)

Reddit r/MachineLearning

[R] Is this paper Nonsense ? [DCdetector: Dual Attention Contrastive Representation Learning for Time Series Anomaly Detection]

Reddit r/MachineLearning

VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

Key Points

Abstract

Related Articles

Chip Smuggling Arrests, OpenClaw Is 'The Next ChatGPT,' and 81K People on AI

The Lemma

Your Agent Will Eventually Do Something Catastrophic. Here's How to Prevent It.

[D] Modeling online discourse escalation as a state machine (dataset + labeling approach)

[R] Is this paper Nonsense ? [DCdetector: Dual Attention Contrastive Representation Learning for Time Series Anomaly Detection]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer