Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
arXiv cs.RO / 5/4/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper introduces Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework to continually improve generalist robot policies beyond what offline demonstrations can achieve.
- LWD closes the loop between deployment and learning by using autonomous robot rollouts plus human interventions across a robot fleet, then redeploying improved Vision-Language-Action (VLA) policies.
- To handle heterogeneous, sparse-reward data from real-world deployments, the method uses Distributional Implicit Value Learning (DIVL) for stable value estimation and Q-learning via Adjoint Matching (QAM) for extracting policies in flow-based VLA action generators.
- Experiments on 16 dual-arm robots performing eight real manipulation tasks show that a single generalist policy improves with accumulated fleet experience, achieving an average success rate of 95% and especially large gains on long-horizon tasks.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge

CLMA Frame Test
Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to

Roundtable chat with Talkie-1930 and Gemma 4 31B
Reddit r/LocalLLaMA