Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
arXiv cs.LG / 5/4/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper explains why RLVR methods like GRPO can lose multi-sample coverage (Pass@K) even when Pass@1 improves, attributing the issue to the objective being indifferent to how probability mass is distributed among correct answers.
- It formalizes a “diversity collapse” mechanism where stochastic training dynamics reinforce concentration of probability on a small subset of valid solutions, suppressing other correct outputs.
- Using robustness and entropy-regularized optimality criteria, the authors characterize a uniquely optimal solution called the Uniform-Correct Policy, which allocates probability uniformly across all correct solutions.
- Based on this analysis, they introduce Uniform-Correct Policy Optimization (UCPO), which modifies GRPO by adding a conditional uniformity penalty to rebalance gradients toward underrepresented correct responses.
- Experiments on three model sizes (1.5B–7B) across five mathematical reasoning benchmarks show UCPO improves Pass@K and diversity with comparable Pass@1, including up to +10% absolute on AIME24@Pass@64 and up to 45% higher equation-level diversity, with code released on GitHub.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.

