Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

arXiv cs.CV / 5/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that standard video diffusion models trained only on raw videos can learn representations that miss geometry-aware 3D structure, despite videos being 2D projections of a 3D world.
It introduces “Geometry Forcing,” a training method that nudges intermediate representations of video diffusion models toward 3D geometry by aligning them with features from a geometric foundation model.
Geometry Forcing uses two complementary objectives: Angular Alignment (directional consistency via cosine similarity) and Scale Alignment (scale information preservation via regression of geometric features).
Experiments on camera-view-conditioned and action-conditioned video generation show improved visual quality and stronger 3D consistency compared with baseline approaches.
The work presents a practical approach for improving world modeling consistency by explicitly injecting geometric constraints into diffusion-based video generation.

Abstract

Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge the gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing geometric features from normalized diffusion representations. We evaluate Geometry Forcing on both camera-view conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.

Antwerp startup Maurice & Nora raises €1M to address rising care demand

Tech.eu

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Dev.to

Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup?

Reddit r/LocalLLaMA

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Key Points

Abstract

Related Articles

Antwerp startup Maurice & Nora raises €1M to address rising care demand

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

Antwerp startup Maurice &amp; Nora raises €1M to address rising care demand

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Antwerp startup Maurice & Nora raises €1M to address rising care demand