Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

arXiv cs.CV / 5/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper frames video generation models as potential “world simulators” capable of modeling physical dynamics and long-horizon causal relationships, but highlights a major efficiency gap versus practical world simulation.
  • It reviews video generation frameworks and emphasizes efficiency as a core requirement, covering how to close the divide between theoretical capability and expensive spatiotemporal computation.
  • The authors propose a new 3D taxonomy organized around efficient modeling paradigms, efficient network architectures, and efficient inference algorithms.
  • They argue that improving efficiency enables interactive use cases such as autonomous driving, embodied AI, and game simulation, and they outline promising future research directions toward real-time, robust world models.
  • The central claim is that efficiency is fundamental for evolving video generators into general-purpose world simulators suitable for interactive and real-world deployment.

Abstract

The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long-horizon causalities, positioning them as potential world simulators. However, a critical gap still remains between the theoretical capacity for world simulation and the heavy computational costs of spatiotemporal modeling. To address this, we comprehensively and systematically review video generation frameworks and techniques that consider efficiency as a crucial requirement for practical world modeling. We introduce a novel taxonomy in three dimensions: efficient modeling paradigms, efficient network architectures, and efficient inference algorithms. We further show that bridging this efficiency gap directly empowers interactive applications such as autonomous driving, embodied AI, and game simulation. Finally, we identify emerging research frontiers in efficient video-based world modeling, arguing that efficiency is a fundamental prerequisite for evolving video generators into general-purpose, real-time, and robust world simulators.