Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

arXiv cs.LG / 5/4/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper explains why RLVR methods like GRPO can lose multi-sample coverage (Pass@K) even when Pass@1 improves, attributing the issue to the objective being indifferent to how probability mass is distributed among correct answers.
  • It formalizes a “diversity collapse” mechanism where stochastic training dynamics reinforce concentration of probability on a small subset of valid solutions, suppressing other correct outputs.
  • Using robustness and entropy-regularized optimality criteria, the authors characterize a uniquely optimal solution called the Uniform-Correct Policy, which allocates probability uniformly across all correct solutions.
  • Based on this analysis, they introduce Uniform-Correct Policy Optimization (UCPO), which modifies GRPO by adding a conditional uniformity penalty to rebalance gradients toward underrepresented correct responses.
  • Experiments on three model sizes (1.5B–7B) across five mathematical reasoning benchmarks show UCPO improves Pass@K and diversity with comparable Pass@1, including up to +10% absolute on AIME24@Pass@64 and up to 45% higher equation-level diversity, with code released on GitHub.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degradation: common RLVR objectives, such as GRPO, are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self-reinforcing collapse, in which probability mass concentrates on a narrow subset of correct outputs while alternative valid solutions are suppressed. We formalize this collapse mechanism and further characterize the optimal policy structure under two complementary criteria: robustness and entropy-regularized optimality, which identify the Uniform-Correct Policy as uniquely optimal. Motivated by this analysis, we propose Uniform-Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty on the policy's distribution over correct solutions. The penalty redistributes gradient signal toward underrepresented correct responses, encouraging uniform allocation of probability mass within the correct set. Across three models (1.5B-7B parameters) and five mathematical reasoning benchmarks, UCPO improves Pass@K and diversity while maintaining competitive Pass@1, achieving up to +10\% absolute improvement on AIME24 at Pass@64 and up to 45\% higher equation-level diversity within the correct set. The code is available at https://github.com/AnamikaLochab/UCPO.