When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning
arXiv cs.LG / 4/28/2026
📰 NewsModels & Research
Key Points
- The paper studies how to adapt deployment-time objectives for offline RL when the trained actor is frozen and cannot be retrained, using Product-of-Experts (PoE) composition with a goal-conditioned prior.
- The authors find that PoE-style steering shows graceful degradation under degraded or random priors, whereas additive or prior-only adaptation can collapse in performance.
- They derive a closed-form equivalence: for diagonal-Gaussian policies, PoE with coefficient α matches the deterministic policy of KL-regularized adaptation with β = α/(1-α), differing mainly in posterior covariance scaling.
- Empirically, across multiple D4RL and AntMaze settings, medium-expert frozen actors reach an “actor-competence ceiling,” and some cases (e.g., behavior-cloned frozen actors on AntMaze) yield zero success across composition rules.
- The work frames PoE and KL-regularized adaptation as essentially the same actor-anchored safety mechanism for deployment-time steering rather than a universal performance booster.
Related Articles

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System
Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)
Dev.to

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹
Dev.to

Real-Time Monitoring for AI Agents: Beyond Log Streaming
Dev.to
Top 10 Physical AI Models Powering Real-World Robots in 2026
MarkTechPost