Motion Forcing: A Decoupled Framework for Robust Video Generation in Motion Dynamics

arXiv cs.CV / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies a trilemma in video generation—high visual quality, physical consistency, and controllability—that degrades in complex scenes like collisions or dense traffic.
It introduces Motion Forcing, a decoupled framework that separates physical reasoning from visual synthesis using a hierarchical Point-Shape-Appearance paradigm.
It proposes Masked Point Recovery, a training strategy that masks input anchors and requires the model to reconstruct complete dynamic depth, encouraging learning of latent physical laws such as inertia.
Extensive experiments on autonomous driving benchmarks and physics/robotics tasks show that Motion Forcing outperforms state-of-the-art baselines and maintains trilemma stability in challenging scenarios.

Abstract

The ultimate goal of video generation is to satisfy a fundamental trilemma: achieving high visual quality, maintaining rigorous physical consistency, and enabling precise controllability. While recent models can maintain this balance in simple, isolated scenarios, we observe that this equilibrium is fragile and often breaks down as scene complexity increases (e.g., involving collisions or dense traffic). To address this, we introduce \textbf{Motion Forcing}, a framework designed to stabilize this trilemma even in complex generative tasks. Our key insight is to explicitly decouple physical reasoning from visual synthesis via a hierarchical \textbf{``Point-Shape-Appearance''} paradigm. This approach decomposes generation into verifiable stages: modeling complex dynamics as sparse geometric anchors (\textbf{Point}), expanding them into dynamic depth maps that explicitly resolve 3D geometry (\textbf{Shape}), and finally rendering high-fidelity textures (\textbf{Appearance}). Furthermore, to foster robust physical understanding, we employ a \textbf{Masked Point Recovery} strategy. By randomly masking input anchors during training and enforcing the reconstruction of complete dynamic depth, the model is compelled to move beyond passive pattern matching and learn latent physical laws (e.g., inertia) to infer missing trajectories. Extensive experiments on autonomous driving benchmarks show that Motion Forcing significantly outperforms state-of-the-art baselines, maintaining trilemma stability across complex scenes. Evaluations on physics and robotics further confirm our framework's generality.