Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas

arXiv cs.CV / 4/1/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Stepper is a new text-driven framework for generating immersive 3D scenes by expanding panoramic scenes step-by-step rather than using one-shot or fully autoregressive methods.
  • It introduces a multi-view 360° diffusion model designed to maintain consistency and enable consistent, high-resolution panoramic expansion.
  • A geometry reconstruction pipeline is used to enforce geometric coherence and reduce failures such as structural inconsistencies.
  • The approach is trained on a newly created large-scale multi-view panorama dataset and is reported to achieve state-of-the-art fidelity and structural consistency versus prior techniques.

Abstract

The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360{\deg} diffusion model that enables consistent, high-resolution expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.