Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

arXiv cs.LG / 5/4/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper explains why RLVR methods like GRPO can lose multi-sample coverage (Pass@K) even when Pass@1 improves, attributing the issue to the objective being indifferent to how probability mass is distributed among correct answers.
It formalizes a “diversity collapse” mechanism where stochastic training dynamics reinforce concentration of probability on a small subset of valid solutions, suppressing other correct outputs.
Using robustness and entropy-regularized optimality criteria, the authors characterize a uniquely optimal solution called the Uniform-Correct Policy, which allocates probability uniformly across all correct solutions.
Based on this analysis, they introduce Uniform-Correct Policy Optimization (UCPO), which modifies GRPO by adding a conditional uniformity penalty to rebalance gradients toward underrepresented correct responses.
Experiments on three model sizes (1.5B–7B) across five mathematical reasoning benchmarks show UCPO improves Pass@K and diversity with comparable Pass@1, including up to +10% absolute on AIME24@Pass@64 and up to 45% higher equation-level diversity, with code released on GitHub.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degradation: common RLVR objectives, such as GRPO, are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self-reinforcing collapse, in which probability mass concentrates on a narrow subset of correct outputs while alternative valid solutions are suppressed. We formalize this collapse mechanism and further characterize the optimal policy structure under two complementary criteria: robustness and entropy-regularized optimality, which identify the Uniform-Correct Policy as uniquely optimal. Motivated by this analysis, we propose Uniform-Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty on the policy's distribution over correct solutions. The penalty redistributes gradient signal toward underrepresented correct responses, encouraging uniform allocation of probability mass within the correct set. Across three models (1.5B-7B parameters) and five mathematical reasoning benchmarks, UCPO improves Pass@K and diversity while maintaining competitive Pass@1, achieving up to +10\% absolute improvement on AIME24 at Pass@64 and up to 45\% higher equation-level diversity within the correct set. The code is available at https://github.com/AnamikaLochab/UCPO.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/4DailyView insight →

Black Hat USA

AI Business

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

The Verge

CLMA Frame Test

Dev.to

You Are Right — You Don't Need CLAUDE.md

Dev.to

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat USA

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

CLMA Frame Test

You Are Right — You Don't Need CLAUDE.md

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer