Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

arXiv cs.LG / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes why existing RLVR post-training methods like GRPO and SDPO underperform: GRPO’s coarse credit assignment penalizes whole failed rollouts uniformly, while SDPO’s self-distillation can become unstable during long training due to ambiguity on already-correct samples and degrading teacher reliability.
It proposes Sample-Routed Policy Optimization (SRPO), an on-policy framework that selectively routes correct samples to GRPO-style reward-aligned reinforcement and failed samples to SDPO-style token/logit-level correction.
SRPO adds an entropy-aware dynamic weighting strategy to downweight high-entropy (less reliable) distillation targets and prioritize confident signals.
Across five benchmarks and two model sizes, SRPO is reported to combine SDPO’s fast early gains with GRPO’s long-horizon stability, outperforming both baselines at their peak performance.
The authors report a 3.4% lift on a five-benchmark average for Qwen3-8B versus GRPO and a 6.3% improvement versus SDPO, alongside moderate response lengths and up to 17.2% lower per-step compute cost.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.