Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

arXiv cs.CV / 5/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that standard video diffusion models trained only on raw videos can learn representations that miss geometry-aware 3D structure, despite videos being 2D projections of a 3D world.
  • It introduces “Geometry Forcing,” a training method that nudges intermediate representations of video diffusion models toward 3D geometry by aligning them with features from a geometric foundation model.
  • Geometry Forcing uses two complementary objectives: Angular Alignment (directional consistency via cosine similarity) and Scale Alignment (scale information preservation via regression of geometric features).
  • Experiments on camera-view-conditioned and action-conditioned video generation show improved visual quality and stronger 3D consistency compared with baseline approaches.
  • The work presents a practical approach for improving world modeling consistency by explicitly injecting geometric constraints into diffusion-based video generation.

Abstract

Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge the gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing geometric features from normalized diffusion representations. We evaluate Geometry Forcing on both camera-view conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.