AI Navigate

InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

arXiv cs.CV / 3/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • InSpatio-WorldFM is introduced as an open-source real-time frame model for spatial intelligence.
  • It uses a frame-based paradigm where each frame is generated independently to achieve low-latency real-time inference, avoiding sequential window-level processing common in video-based world models.
  • The approach enforces multi-view spatial consistency using explicit 3D anchors and implicit spatial memory to preserve global scene geometry while keeping fine details across viewpoints.
  • A progressive three-stage training pipeline converts a pretrained image diffusion model into a controllable frame model and then into a real-time generator via few-step distillation.
  • Experimental results show strong multi-view consistency and the ability to support interactive exploration on consumer-grade GPUs, offering an efficient alternative to video-based world models for real-time world simulation.

Abstract

We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.