AI Navigate

Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation

arXiv cs.CV / 3/13/2026

📰 NewsModels & Research

Key Points

  • Ada3Drift proposes a training-time drifting field that attracts predicted actions toward expert demonstration modes while repelling them from other samples, enabling high-fidelity single-step (1 NFE) visuomotor generation from 3D point clouds.
  • It addresses multimodal action distributions in diffusion-based policies by preserving distinct modes instead of collapsing to a single averaged trajectory.
  • The method introduces a sigmoid-scheduled loss that transitions from coarse distribution learning to mode sharpening refinement and uses multi-scale field aggregation to capture action modes at different spatial scales.
  • It achieves state-of-the-art performance on Adroit, Meta-World, and RoboTwin benchmarks and real-world tasks, while using about 10x fewer function evaluations than diffusion-based approaches.
  • This work advances real-time robotic manipulation by enabling efficient one-step generation, potentially enabling faster control pipelines.

Abstract

Diffusion-based visuomotor policies effectively capture multimodal action distributions through iterative denoising, but their high inference latency limits real-time robotic control. Recent flow matching and consistency-based methods achieve single-step generation, yet sacrifice the ability to preserve distinct action modes, collapsing multimodal behaviors into averaged, often physically infeasible trajectories. We observe that the compute budget asymmetry in robotics (offline training vs.\ real-time inference) naturally motivates recovering this multimodal fidelity by shifting iterative refinement from inference time to training time. Building on this insight, we propose Ada3Drift, which learns a training-time drifting field that attracts predicted actions toward expert demonstration modes while repelling them from other generated samples, enabling high-fidelity single-step generation (1 NFE) from 3D point cloud observations. To handle the few-shot robotic regime, Ada3Drift further introduces a sigmoid-scheduled loss transition from coarse distribution learning to mode-sharpening refinement, and multi-scale field aggregation that captures action modes at varying spatial granularities. Experiments on three simulation benchmarks (Adroit, Meta-World, and RoboTwin) and real-world robotic manipulation tasks demonstrate that Ada3Drift achieves state-of-the-art performance while requiring 10\times fewer function evaluations than diffusion-based alternatives.