OccSim: Multi-kilometer Simulation with Long-horizon Occupancy World Models

arXiv cs.AI / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • OccSim introduces a data-driven, occupancy world model-driven 3D simulator that generates long-horizon driving scenes without relying on pre-recorded logs or HD maps, using only a single initial frame plus future ego actions.
  • The system can stably produce more than 3,000 continuous frames, enabling simulation-based construction of large-scale 3D occupancy maps over 4 kilometers—over an 80x improvement versus prior occupancy world models’ stable generation length.
  • OccSim is built from a W-DiT-based static occupancy world model that uses explicit rigid transformations to extend ultra-long-horizon environment synthesis, alongside a Layout Generator that populates reactive dynamic agents from synthesized road topology.
  • Experiments suggest OccSim-generated data can be used to pre-train 4D semantic occupancy forecasting models, achieving up to 67% zero-shot performance on unseen data and improving over asset-based simulators by up to 11% (rising to ~74% and 22.1% when scaling the dataset 5x).

Abstract

Data-driven autonomous driving simulation has long been constrained by its heavy reliance on pre-recorded driving logs or spatial priors, such as HD maps. This fundamental dependency severely limits scalability, restricting open-ended generation capabilities to the finite scale of existing collected datasets. To break this bottleneck, we present OccSim, the first occupancy world model-driven 3D simulator. OccSim obviates the requirement for continuous logs or HD maps; conditioned only on a single initial frame and a sequence of future ego-actions, it can stably generate over 3,000 continuous frames, enabling the continuous construction of large-scale 3D occupancy maps spanning over 4 kilometers for simulation. This represents an >80x improvement in stable generation length over previous state-of-the-art occupancy world models. OccSim is powered by two modules: W-DiT based static occupancy world model and the Layout Generator. W-DiT handles the ultra-long-horizon generation of static environments by explicitly introducing known rigid transformations in architecture design, while the Layout Generator populates the dynamic foreground with reactive agents based on the synthesized road topology. With these designs, OccSim can synthesize massive, diverse simulation streams. Extensive experiments demonstrate its downstream utility: data collected directly from OccSim can pre-train 4D semantic occupancy forecasting models to achieve up to 67% zero-shot performance on unseen data, outperforming previous asset-based simulator by 11%. When scaling the OccSim dataset to 5x the size, the zero-shot performance increases to about 74%, while the improvement over asset-based simulators expands to 22.1%.