Drift-Based Policy Optimization: Native One-Step Policy Learning for Online Robot Control

arXiv cs.RO / 4/7/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes Drift-Based Policy (DBP) to convert multi-step generative robotic control policies into a native one-step generative backbone by training the model to internalize iterative refinement through fixed-point drifting objectives.
It introduces Drift-Based Policy Optimization (DBPO), an online RL method that adds a compatible stochastic interface to the pretrained DBP backbone, enabling stable on-policy updates while keeping one-step, low-inference-cost deployment.
Experiments across offline imitation learning, online fine-tuning, and real-world control show DBP matches or exceeds multi-step diffusion policy performance while delivering up to 100× faster inference.
In challenging manipulation benchmarks, DBP also outperforms existing one-step policy baselines, and DBPO supports reliable and stable online policy improvement.
A real-world dual-arm robot experiment reports high-frequency closed-loop control performance at 105.2 Hz, demonstrating the practical feasibility of the one-step approach for online robot control.

Abstract

Although multi-step generative policies achieve strong performance in robotic manipulation by modeling multimodal action distributions, they require multi-step iterative denoising at inference time. Each action therefore needs tens to hundreds of network function evaluations (NFEs), making them costly for high-frequency closed-loop control and online reinforcement learning (RL). To address this limitation, we propose a two-stage framework for native one-step generative policies that shifts refinement from inference to training. First, we introduce the Drift-Based Policy (DBP), which leverages fixed-point drifting objectives to internalize iterative refinement into the model parameters, yielding a one-step generative backbone by design while preserving multimodal action modeling capacity. Second, we develop Drift-Based Policy Optimization (DBPO), an online RL framework that equips the pretrained backbone with a compatible stochastic interface, enabling stable on-policy updates without sacrificing the one-step deployment property. Extensive experiments demonstrate the effectiveness of the proposed framework across offline imitation learning, online fine-tuning, and real-world control scenarios. DBP matches or exceeds the performance of multi-step diffusion policies while achieving up to

100\times

faster inference. It also consistently outperforms existing one-step baselines on challenging manipulation benchmarks. Moreover, DBPO enables effective and stable policy improvement in online settings. Experiments on a real-world dual-arm robot demonstrate reliable high-frequency control at 105.2 Hz.

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Moving from proof of concept to production: what we learned with Nometria

Dev.to

Frontend Engineers Are Becoming AI Trainers

Dev.to

Drift-Based Policy Optimization: Native One-Step Policy Learning for Online Robot Control

Key Points

Abstract

Related Articles

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

Moving from proof of concept to production: what we learned with Nometria

Frontend Engineers Are Becoming AI Trainers

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer