Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
arXiv cs.LG / 4/3/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper analyzes why existing RLVR post-training methods like GRPO and SDPO underperform: GRPO’s coarse credit assignment penalizes whole failed rollouts uniformly, while SDPO’s self-distillation can become unstable during long training due to ambiguity on already-correct samples and degrading teacher reliability.
- It proposes Sample-Routed Policy Optimization (SRPO), an on-policy framework that selectively routes correct samples to GRPO-style reward-aligned reinforcement and failed samples to SDPO-style token/logit-level correction.
- SRPO adds an entropy-aware dynamic weighting strategy to downweight high-entropy (less reliable) distillation targets and prioritize confident signals.
- Across five benchmarks and two model sizes, SRPO is reported to combine SDPO’s fast early gains with GRPO’s long-horizon stability, outperforming both baselines at their peak performance.
- The authors report a 3.4% lift on a five-benchmark average for Qwen3-8B versus GRPO and a 6.3% improvement versus SDPO, alongside moderate response lengths and up to 17.2% lower per-step compute cost.
Related Articles

Black Hat Asia
AI Business

Mistral raises $830M, 9fin hits unicorn status, and new Tech.eu Summit speakers unveiled
Tech.eu

ChatGPT costs $20/month. I built an alternative for $2.99.
Dev.to

OpenAI shifts to usage-based pricing for Codex in ChatGPT business plans
THE DECODER

Why I built an AI assistant that doesn't know who you are
Dev.to