Drift-Based Policy Optimization: Native One-Step Policy Learning for Online Robot Control

arXiv cs.RO / 4/7/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Drift-Based Policy (DBP) to convert multi-step generative robotic control policies into a native one-step generative backbone by training the model to internalize iterative refinement through fixed-point drifting objectives.
  • It introduces Drift-Based Policy Optimization (DBPO), an online RL method that adds a compatible stochastic interface to the pretrained DBP backbone, enabling stable on-policy updates while keeping one-step, low-inference-cost deployment.
  • Experiments across offline imitation learning, online fine-tuning, and real-world control show DBP matches or exceeds multi-step diffusion policy performance while delivering up to 100× faster inference.
  • In challenging manipulation benchmarks, DBP also outperforms existing one-step policy baselines, and DBPO supports reliable and stable online policy improvement.
  • A real-world dual-arm robot experiment reports high-frequency closed-loop control performance at 105.2 Hz, demonstrating the practical feasibility of the one-step approach for online robot control.

Abstract

Although multi-step generative policies achieve strong performance in robotic manipulation by modeling multimodal action distributions, they require multi-step iterative denoising at inference time. Each action therefore needs tens to hundreds of network function evaluations (NFEs), making them costly for high-frequency closed-loop control and online reinforcement learning (RL). To address this limitation, we propose a two-stage framework for native one-step generative policies that shifts refinement from inference to training. First, we introduce the Drift-Based Policy (DBP), which leverages fixed-point drifting objectives to internalize iterative refinement into the model parameters, yielding a one-step generative backbone by design while preserving multimodal action modeling capacity. Second, we develop Drift-Based Policy Optimization (DBPO), an online RL framework that equips the pretrained backbone with a compatible stochastic interface, enabling stable on-policy updates without sacrificing the one-step deployment property. Extensive experiments demonstrate the effectiveness of the proposed framework across offline imitation learning, online fine-tuning, and real-world control scenarios. DBP matches or exceeds the performance of multi-step diffusion policies while achieving up to 100\times faster inference. It also consistently outperforms existing one-step baselines on challenging manipulation benchmarks. Moreover, DBPO enables effective and stable policy improvement in online settings. Experiments on a real-world dual-arm robot demonstrate reliable high-frequency control at 105.2 Hz.