DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks

arXiv cs.CV / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces the Causal Latent World Model (CLWM), a generative world-action modeling approach for robotic manipulation that disentangles interaction semantics from visual noise using DINOv3 features.
CLWM addresses key deployment bottlenecks by using a Dual-State Test-Time Training (TTT) Memory that keeps long-horizon task memory usage at a strict O(1) footprint.
To reduce sequential inference latency during deployment, it proposes Speculative Asynchronous Inference (SAI), which overlaps partial diffusion denoising with physical execution to cut blocking latency by about 50%.
For scaling robust embodied policies, the work presents EmbodiChain, an online training framework that injects an infinite flow of physics-grounded trajectories and claims an “Efficiency Law.”
Experiments on dual-arm simulation and real physical robots show state-of-the-art performance and unprecedented zero-shot sim-to-real transfer, outperforming methods that are explicitly fine-tuned on real-world data.

Abstract

Deploying generative World-Action Models for manipulation is severely bottlenecked by redundant pixel-level reconstruction,

\mathcal{O}(T)

memory scaling, and sequential inference latency. We introduce the Causal Latent World Model (CLWM), which employs DINOv3 features as generative targets to disentangle interaction semantics from visual noise, yielding highly robust domain generalization. To overcome memory scaling, CLWM features a Dual-State Test-Time Training (TTT) Memory that guarantees a strict

\mathcal{O}(1)

footprint for long-horizon tasks. To overcome deployment latency, we propose Speculative Asynchronous Inference (SAI) to mask partial diffusion denoising behind physical execution, cutting blocking latency by about

50\%

. To scale robust policies, we present EmbodiChain, an online framework that establishes the Efficiency Law by injecting an infinite flow of physics-grounded trajectories during training. Extensive experiments validate that CLWM achieves state-of-the-art performance in complex dual-arm simulation and unprecedented zero-shot sim-to-real transfer on physical robots, outperforming baselines explicitly finetuned on real-world data.

A practical guide to getting comfortable with AI coding tools

Dev.to

Every time a new model comes out, the old one is obsolete of course

Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

🚀 Major BrowserAct CLI Update

Dev.to

DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks

Key Points

Abstract

Related Articles

A practical guide to getting comfortable with AI coding tools

Every time a new model comes out, the old one is obsolete of course

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

🚀 Major BrowserAct CLI Update

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer