EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training
arXiv cs.LG / 4/22/2026
📰 NewsModels & Research
Key Points
- The paper addresses a key RL design choice in LLM post-training: whether to rely on a learned critic as a baseline for policy optimization, which can affect variance behavior in sparse-reward regimes.
- It argues that in sparse rewards, a learned critic may add estimation noise that outweighs the state signal, thereby increasing (not reducing) advantage variance, and it provides a unified Kalman-filtering view of PPO vs. critic-free GRPO.
- By framing baseline selection via explained variance (EV), the authors derive a batch-computable criterion: positive EV means the critic reduces variance, while zero/negative EV indicates the critic inflates variance.
- They propose Explained Variance Policy Optimization (EVPO), which adaptively switches between critic-based and batch-mean advantage estimation at each step based on EV, guaranteeing no worse variance than the better option at that step.
- Experiments across four task types (classical control, agentic interaction, and mathematical reasoning) show EVPO consistently outperforms both PPO and GRPO, with additional evidence that EV-based gating tracks critic maturation and that the EV zero threshold is empirically optimal.
Related Articles

Autoencoders and Representation Learning in Vision
Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks
Dev.to
Context Bloat in AI Agents
Dev.to

We open sourced the AI dev team that builds our product
Dev.to

Intel LLM-Scaler vllm-0.14.0-b8.2 released with official Arc Pro B70 support
Reddit r/artificial