Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic
arXiv stat.ML / 2026/3/24
💬 オピニオンIdeas & Deep AnalysisModels & Research
要点
- The paper develops a unified theoretical framework showing that Group Relative Policy Optimization (GRPO) policy gradients can be expressed as a U-statistic, clarifying why GRPO works in practice.
- It derives statistical properties for GRPO, including a mean-squared-error characterization, finite-sample error bounds, and the asymptotic distribution of the suboptimality gap for the learned policy.
- The authors prove GRPO is asymptotically equivalent to an “oracle” policy-gradient method that has access to a value function measuring policy quality at each training iteration, implying near-optimal long-run performance.
- A universal scaling law is established to guide selection of the optimal group size, and experiments validate both the universality of the optimal group size and the oracle-like behavior.
- Overall, the work links GRPO’s widely used empirical performance (notably in DeepSeekMath/DeepSeek-R1) to classical statistics, enabling more principled tuning and theoretical guarantees.

