INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

arXiv cs.CV / 4/9/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • INSPATIO-WORLDは、単一の参照動画から高精細かつ実時間で、空間的に一貫した4D(時空間)環境を回復・生成し、ユーザーのナビゲーションを可能にする新しいフレームワークです。
  • Spatiotemporal Autoregressive(STAR)アーキテクチャにより、Implicit Spatiotemporal Cacheが長期ナビゲーションにおけるグローバルな整合性を維持し、Explicit Spatial Constraint Moduleが幾何構造とユーザー操作を物理的に妥当なカメラ軌道へ反映します。
  • Joint Distribution Matching Distillation(JDMD)を導入し、合成データへの依存によって起きがちな画質劣化を、実データ分布を正則化ガイドとして用いることで抑えます。
  • 実験では、空間整合性とインタラクション精度で既存のSOTAを大きく上回り、WorldScore-Dynamicベンチマークで実時間インタラクティブ手法として首位を獲得したと報告されています。

Abstract

Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.