VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation

arXiv cs.RO / 4/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Diffusion policies for robotic manipulation can train slowly and time out in inference because uniform sampling ignores sample difficulty and creates hard negative class imbalance.
The proposed VADF framework uses a vision-driven dual-adaptive design that is model-agnostic, so it can be integrated with different diffusion-policy architectures.
During training, VADF introduces an Adaptive Loss Network (ALN) that predicts per-step difficulty and applies hard negative mining with weighted sampling to speed up convergence.
During inference, VADF’s Hierarchical Vision Task Segmenter (HVTS) breaks high-level visual-guided instructions into multi-stage sub-instructions and assigns different noise schedules to simple vs. complex subtasks to cut computation and boost early success.
The work reports that VADF reduces the number of convergence steps and improves early inference success relative to the cited diffusion-policy limitations.

Abstract

Diffusion policies are becoming mainstream in robotic manipulation but suffer from hard negative class imbalance due to uniform sampling and lack of sample difficulty awareness, leading to slow training convergence and frequent inference timeout failures. We propose VADF (Vision-Adaptive Diffusion Policy Framework), a vision-driven dual-adaptive framework that significantly reduces convergence steps and achieves early success in inference, with model-agnostic design enabling seamless integration into any diffusion policy architecture. During training, we introduce Adaptive Loss Network (ALN), a lightweight MLP-based loss predictor that quantifies per-step sample difficulty in real time. Guided by hard negative mining, it performs weighted sampling to prioritize high-loss regions, enabling adaptive weight updates and faster convergence. In inference, we design the Hierarchical Vision Task Segmenter (HVTS), which decomposes high-level task instructions into multi-stage low-level sub-instructions based on visual input. It adaptively segments action sequences into simple and complex subtasks by assigning shorter noise schedules with longer direct execution sequences to simple actions, and longer noise steps with shorter execution sequences to complex ones, thereby dramatically reducing computational overhead and significantly improving the early success rate.

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Dev.to

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else

Dev.to

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

Dev.to

VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation

Key Points

Abstract

Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer