DiffusionAnything: End-to-End In-context Diffusion Learning for Unified Navigation and Pre-Grasp Motion
arXiv cs.RO / 3/30/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- DiffusionAnything proposes an end-to-end diffusion-based robotics policy that predicts unified navigation and pre-grasp manipulation directly from RGB images, avoiding explicit goal specification and task-specific planning pipelines.
- The approach uses multi-scale FiLM conditioning (task mode, depth scale, and spatial attention) plus trajectory-aligned depth prediction to support metric 3D reasoning across both meter-scale and centimeter-scale tasks in a single model.
- A self-supervised attention mechanism drawn from AnyTraverse enables goal-directed zero-shot inference without relying on vision-language models or depth sensors.
- The method reports strong zero-shot generalization to novel scenes while requiring only about 5 minutes of self-supervised data per task and running efficiently onboard (≈2.0 GB memory, 10 Hz).
- Overall, the work positions diffusion policies as a more computationally efficient, data-efficient, and sensor-light alternative to heavy VLA systems for robot motion planning.

