Calibration-Aware Policy Optimization for Reasoning LLMs
arXiv cs.LG / 4/15/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper analyzes why GRPO-style optimization for reasoning LLMs can worsen relative calibration, showing it stems from uncertainty-agnostic advantage estimation that misaligns optimization gradients with calibration objectives.
- It introduces Calibration-Aware Policy Optimization (CAPO), which uses a logistic AUC surrogate loss with theoretically grounded consistency and regret bounds to enable uncertainty-aware advantage estimation.
- CAPO adds a noise-masking mechanism to stabilize training while jointly improving calibration and reasoning accuracy.
- Experiments on mathematical reasoning benchmarks report up to 15% calibration gains for CAPO-1.5B with accuracy comparable to or better than GRPO, plus up to 5% improvements on inference-time scaling tasks.
- When the model is allowed to abstain on low-confidence outputs, CAPO achieves a Pareto-optimal precision–coverage trade-off, indicating potential for hallucination mitigation.
Related Articles

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG
Dev.to
Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]
Reddit r/MachineLearning

How AI Interview Assistants Are Changing Job Preparation in 2026
Dev.to

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness
Dev.to

NEW PROMPT INJECTION
Dev.to