DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving

arXiv cs.CV / 4/2/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes DLWM (Dual Latent World Models), a two-stage training paradigm aimed at holistic Gaussian-centric pre-training for vision-based autonomous driving.
  • In stage one, DLWM learns to predict 3D semantic Gaussians from queries by self-supervised reconstruction of multi-view semantic and depth images to obtain fine-grained contextual features.
  • In stage two, it trains two separate latent world models for temporal feature learning: one using Gaussian-flow-guided latent prediction for occupancy perception and 4D occupancy forecasting, and another using ego-planning-guided latent prediction for motion planning.
  • Experiments on the SurroundOcc and nuScenes benchmarks show significant performance gains across Gaussian-centric 3D occupancy perception, 4D occupancy forecasting, and motion planning tasks.

Abstract

Vision-based autonomous driving has gained much attention due to its low costs and excellent performance. Compared with dense BEV (Bird's Eye View) or sparse query models, Gaussian-centric method is a comprehensive yet sparse representation by describing scene with 3D semantic Gaussians. In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian-centric pre-training in autonomous driving using two stages. In the first stage, DLWM predicts 3D Gaussians from queries by self-supervised reconstructing multi-view semantic and depth images. Equipped with fine-grained contextual features, in the second stage, two latent world models are trained separately for temporal feature learning, including Gaussian-flow-guided latent prediction for downstream occupancy perception and forecasting tasks, and ego-planning-guided latent prediction for motion planning. Extensive experiments in SurroundOcc and nuScenes benchmarks demonstrate that DLWM shows significant performance gains across Gaussian-centric 3D occupancy perception, 4D occupancy forecasting and motion planning tasks.