DiffusionAnything: End-to-End In-context Diffusion Learning for Unified Navigation and Pre-Grasp Motion

arXiv cs.RO / 3/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

DiffusionAnything proposes an end-to-end diffusion-based robotics policy that predicts unified navigation and pre-grasp manipulation directly from RGB images, avoiding explicit goal specification and task-specific planning pipelines.
The approach uses multi-scale FiLM conditioning (task mode, depth scale, and spatial attention) plus trajectory-aligned depth prediction to support metric 3D reasoning across both meter-scale and centimeter-scale tasks in a single model.
A self-supervised attention mechanism drawn from AnyTraverse enables goal-directed zero-shot inference without relying on vision-language models or depth sensors.
The method reports strong zero-shot generalization to novel scenes while requiring only about 5 minutes of self-supervised data per task and running efficiently onboard (≈2.0 GB memory, 10 Hz).
Overall, the work positions diffusion policies as a more computationally efficient, data-efficient, and sensor-light alternative to heavy VLA systems for robot motion planning.

Abstract

Efficiently predicting motion plans directly from vision remains a fundamental challenge in robotics, where planning typically requires explicit goal specification and task-specific design. Recent vision-language-action (VLA) models infer actions directly from visual input but demand massive computational resources, extensive training data, and fail zero-shot in novel scenes. We present a unified image-space diffusion policy handling both meter-scale navigation and centimeter-scale manipulation via multi-scale feature modulation, with only 5 minutes of self-supervised data per task. Three key innovations drive the framework: (1) Multi-scale FiLM conditioning on task mode, depth scale, and spatial attention enables task-appropriate behavior in a single model; (2) trajectory-aligned depth prediction focuses metric 3D reasoning along generated waypoints; (3) self-supervised attention from AnyTraverse enables goal-directed inference without vision-language models and depth sensors. Operating purely from RGB input (2.0 GB memory, 10 Hz), the model achieves robust zero-shot generalization to novel scenes while remaining suitable for onboard deployment.

Black Hat Asia

AI Business

What is ‘Harness Design’ and why does it matter

Dev.to

35 Views, 0 Dollars, 12 Articles: My Brutally Honest Numbers After 4 Days as an AI Agent

Dev.to

Robotic Brain for Elder Care 2

Dev.to

Traditional Web Scraping is Dead in 2026. Here's Why.

Dev.to

DiffusionAnything: End-to-End In-context Diffusion Learning for Unified Navigation and Pre-Grasp Motion

Key Points

Abstract

Related Articles

Black Hat Asia

What is ‘Harness Design’ and why does it matter

35 Views, 0 Dollars, 12 Articles: My Brutally Honest Numbers After 4 Days as an AI Agent

Robotic Brain for Elder Care 2

Traditional Web Scraping is Dead in 2026. Here's Why.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer