Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

arXiv cs.CV / 4/13/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • Matrix-Game 3.0 is presented as a memory-augmented interactive world model aimed at 720p real-time long-form video generation while preserving long-horizon temporal/spatiotemporal consistency.
  • The work improves training data generation and scaling by combining Unreal Engine synthetic data, automated collection from AAA games, and real-world video augmentation to build large-scale Video-Pose-Action-Prompt quadruplet datasets.
  • It introduces a long-horizon consistency training method that models prediction residuals and uses self-correction via re-injection of imperfect generated frames, supported by camera-aware memory retrieval and injection.
  • For real-time deployment, the model uses a multi-segment autoregressive distillation approach (Distribution Matching Distillation), along with quantization and VAE decoder pruning to reduce inference cost.
  • Experiments report up to 40 FPS at 720p using a 5B model with stable minute-long memory consistency, and scaling to 2×14B improves quality, dynamics, and generalization.

Abstract

With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.