Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning

arXiv cs.LG / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies length inflation as a key challenge in reinforcement learning for LLMs, where models generate verbose or inefficient reasoning to maximize rewards.
It introduces Group Relative Reward Rescaling (GR^3), a multiplicative approach to length control that avoids issues associated with additive penalties and heuristic gating.
GR^3 uses group-relative regularization and advantage-aware calibration to adapt length budgets based on instance difficulty while preserving the value of high-quality trajectories.
Empirically, GR^3 maintains training dynamics and downstream performance similar to standard GRPO in RLHF and RLVR settings while significantly reducing length inflation and outperforming state-of-the-art length-regularized baselines.

Abstract

Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR

^3

), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR

^3

~maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.

AI's Economic Impact Falls Short: Addressing the Gap Between Investment and Measurable Growth

Dev.to

The Inception Loop: A Month in the Life of a Self-Improving AI Sidekick

Dev.to

The Editing Tax: Why AI 'Saves Time' Until It Doesn't — And How to Reduce Rework

Dev.to

AI Can Write Your Code. Who's Testing Your Thinking?

Dev.to

[R] Weekly digest: arXiv AI security papers translated for practitioners -- Cascade (cross-stack CVE+Rowhammer attacks on compound AI), LAMLAD (dual-LLM adversarial ML, 97% evasion), OpenClaw (4 vuln classes in agent frameworks)

Reddit r/MachineLearning

Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning

Key Points

Abstract

Related Articles

AI's Economic Impact Falls Short: Addressing the Gap Between Investment and Measurable Growth

The Inception Loop: A Month in the Life of a Self-Improving AI Sidekick

The Editing Tax: Why AI 'Saves Time' Until It Doesn't — And How to Reduce Rework

AI Can Write Your Code. Who's Testing Your Thinking?

[R] Weekly digest: arXiv AI security papers translated for practitioners -- Cascade (cross-stack CVE+Rowhammer attacks on compound AI), LAMLAD (dual-LLM adversarial ML, 97% evasion), OpenClaw (4 vuln classes in agent frameworks)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer