VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation
arXiv cs.RO / 3/30/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces VLA-OPD, a post-training framework for vision-language-action (VLA) robotic models that combines the efficiency of offline supervised fine-tuning (SFT) with the robustness of online reinforcement learning (RL).
- Instead of using sparse environmental rewards, VLA-OPD uses an expert teacher to provide dense, token-level supervision on the student’s own self-generated trajectories, enabling corrective learning on policy-induced states.
- The method uses a Reverse-KL objective to stabilize learning, aiming to avoid the entropy issues of Forward-KL and the premature entropy collapse associated with Hard cross-entropy.
- Experiments on LIBERO and RoboTwin2.0 show that VLA-OPD improves sample efficiency versus RL, increases robustness versus SFT, and mitigates catastrophic forgetting of pre-trained capabilities.
- Overall, the approach frames post-training as “gentle alignment” that preserves prior generalization while correcting errors during distribution shift induced by the evolving policy.
Related Articles

Black Hat Asia
AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to

I missed the "fun" part in software development
Dev.to

The Billion Dollar Tax on AI Agents
Dev.to