AI Navigate

Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning

arXiv cs.LG / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies length inflation as a key challenge in reinforcement learning for LLMs, where models generate verbose or inefficient reasoning to maximize rewards.
  • It introduces Group Relative Reward Rescaling (GR^3), a multiplicative approach to length control that avoids issues associated with additive penalties and heuristic gating.
  • GR^3 uses group-relative regularization and advantage-aware calibration to adapt length budgets based on instance difficulty while preserving the value of high-quality trajectories.
  • Empirically, GR^3 maintains training dynamics and downstream performance similar to standard GRPO in RLHF and RLVR settings while significantly reducing length inflation and outperforming state-of-the-art length-regularized baselines.

Abstract

Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR^3), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR^3~maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.