Evolution of Video Generative Foundations

arXiv cs.CV / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The article summarizes recent progress in AIGC video generation, highlighting both proprietary systems (e.g., Sora, Veo3, Seedance) and open-source models (e.g., Wan, HunyuanVideo) that improve temporal coherence and semantic richness.
  • It identifies gaps in existing reviews—often limited to specific model families like GANs or diffusion, or to narrower tasks like video editing—and proposes a more comprehensive historical evolution perspective.
  • The survey traces video generation advances from early GAN-based approaches to diffusion models, and then to emerging auto-regressive (AR) and multimodal techniques.
  • It analyzes foundational principles and compares strengths and limitations across approaches, with special focus on multimodal integration to boost contextual awareness.
  • The paper links these developments to broader “world model” directions and potential applications such as VR/AR, education, autonomous-driving simulation, and digital entertainment.

Abstract

The rapid advancement of Artificial Intelligence Generated Content (AIGC) has revolutionized video generation, enabling systems ranging from proprietary pioneers like OpenAI's Sora, Google's Veo3, and Bytedance's Seedance to powerful open-source contenders like Wan and HunyuanVideo to synthesize temporally coherent and semantically rich videos. These advancements pave the way for building "world models" that simulate real-world dynamics, with applications spanning entertainment, education, and virtual reality. However, existing reviews on video generation often focus on narrow technical fields, e.g., Generative Adversarial Networks (GAN) and diffusion models, or specific tasks (e. g., video editing), lacking a comprehensive perspective on the field's evolution, especially regarding Auto-Regressive (AR) models and integration of multimodal information. To address these gaps, this survey firstly provides a systematic review of the development of video generation technology, tracing its evolution from early GANs to dominant diffusion models, and further to emerging AR-based and multimodal techniques. We conduct an in-depth analysis of the foundational principles, key advancements, and comparative strengths/limitations. Then, we explore emerging trends in multimodal video generation, emphasizing the integration of diverse data types to enhance contextual awareness. Finally, by bridging historical developments and contemporary innovations, this survey offers insights to guide future research in video generation and its applications, including virtual/augmented reality, personalized education, autonomous driving simulations, digital entertainment, and advanced world models, in this rapidly evolving field. For more details, please refer to the project at https://github.com/sjtuplayer/Awesome-Video-Foundations.