Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

arXiv cs.CV / 4/30/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces X-WAM, a unified 4D world model that combines real-time robotic action execution with high-fidelity future 4D scene synthesis (multi-view RGB-D plus 3D reconstruction), addressing limits of prior unified world models focused only on 2D pixel space.
  • X-WAM leverages pretrained video diffusion transformers by predicting future multi-view RGB-D video and efficiently extracting spatial/depth information using a lightweight architectural modification that adds a dedicated depth prediction branch.
  • The method proposes Asynchronous Noise Sampling (ANS), which uses a specialized asynchronous denoising schedule at inference to decode robot actions in fewer steps while allocating the full denoising process to produce higher-quality video.
  • Unlike fully decoupled timestep training, ANS samples from the joint timestep distribution to better match the inference-time behavior, aiming for consistent generation and action decoding.
  • On robotic benchmarks, X-WAM is pretrained on 5,800+ hours of robot data and reports strong performance—79.2% success on RoboCasa and 90.7% on RoboTwin 2.0—while improving both visual and geometric quality metrics over existing approaches.

Abstract

We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of prior unified world models (e.g., UWM) that only model 2D pixel-space and fail to balance action efficiency and world modeling quality. To leverage the strong visual priors of pretrained video diffusion models, X-WAM imagines the future world by predicting multi-view RGB-D videos, and obtains spatial information efficiently through a lightweight structural adaptation: replicating the final few blocks of the pretrained Diffusion Transformer into a dedicated depth prediction branch for the reconstruction of future spatial information. Moreover, we propose Asynchronous Noise Sampling (ANS) to jointly optimize generation quality and action decoding efficiency. ANS applies a specialized asynchronous denoising schedule during inference, which rapidly decodes actions with fewer steps to enable efficient real-time execution, while dedicating the full sequence of steps to generate high-fidelity video. Rather than entirely decoupling the timesteps during training, ANS samples from their joint distribution to align with the inference distribution. Pretrained on over 5,800 hours of robotic data, X-WAM achieves 79.2% and 90.7% average success rate on RoboCasa and RoboTwin 2.0 benchmarks, while producing high-fidelity 4D reconstruction and generation surpassing existing methods in both visual and geometric metrics.