Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning

arXiv cs.RO / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes POCO (Posterior Optimization with Clipped Objective), an RL framework that turns generative policy improvement into a posterior inference problem over temporally extended action chunks.
  • POCO uses an Expectation-Maximization-style procedure to distill a reward-weighted implicit posterior into the policy without requiring explicit likelihood estimation.
  • It introduces an offline-to-online training strategy that ties online exploration to pre-trained policy priors, aiming to improve stability and sample efficiency.
  • The method is model-agnostic, so it can fine-tune large VLA (vision-language-action) models without architectural changes.
  • Experiments on 7 simulation benchmarks and 4 real-world contact-rich robotic tasks report that POCO avoids catastrophic policy collapse, beats state-of-the-art baselines, and reaches a 96.7% success rate in real-world tests.

Abstract

Expressive generative models have advanced robotic manipulation by capturing complex, multi-modal action distributions over temporally extended trajectories. However, fine-tuning these policies via RL remains challenging due to instability and sample inefficiency. We introduce Posterior Optimization with Clipped Objective (POCO), a principled RL framework that formulates policy improvement as a posterior inference problem tailored for temporal action chunks. Through an Expectation-Maximization procedure, POCO distills a reward-weighted implicit posterior into the policy without likelihood estimation. Furthermore, POCO adopts an offline-to-online paradigm that anchors online exploration to pre-trained priors, and its model-agnostic design scales to fine-tune large VLA models without architectural modifications. Evaluations across 7 simulation benchmarks and 4 contact-rich real-world tasks demonstrate that POCO prevents catastrophic policy collapse, outperforms SOTA baselines, and achieves a 96.7% success rate on real-world tasks. Videos are available at our project website https://cccedric.github.io/poco/.