Pixel Motion Diffusion is What We Need for Robot Control

arXiv cs.RO / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The article introduces DAWN (Diffusion is All We Need), a unified, diffusion-based framework for language-conditioned robotic manipulation that connects high-level motion intent to low-level robot actions through a structured pixel motion representation.
  • It models both the high-level and low-level controllers as diffusion processes, enabling an end-to-end trainable system with interpretable intermediate motion abstractions.
  • DAWN reportedly achieves state-of-the-art performance on the CALVIN benchmark for multi-task robotic learning and also shows strong results on MetaWorld.
  • The authors address the simulation-to-reality domain gap by demonstrating reliable real-world transfer with only minimal finetuning despite limited real-world data.
  • The work positions diffusion modeling combined with motion-centric visual abstractions as a scalable, robust baseline for future robot learning systems.

Abstract

We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://eronguyen.github.io/DAWN/