Self-Distilled RLVR
arXiv cs.LG / 4/6/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper contrasts on-policy distillation (OPD) for LLMs with RLVR, noting that distillation provides dense trajectory-level signals while RLVR offers sparse, verifiable reward signals.
- It argues that prior on-policy self-distillation (OPSD), where the same model acts as teacher and student using privileged information, can cause severe information leakage and unstable long-term training when signals are derived only from the privileged teacher.
- To address these issues, the authors propose RLSD (RLVR with Self-Distillation), which uses self-distillation primarily to produce token-level policy difference signals that set fine-grained update magnitudes.
- RLSD still relies on RLVR-style environmental feedback (e.g., response correctness) to determine reliable update directions, combining stability from RLVR with richer learning signals from self-distillation.
- The paper reports that this hybrid approach improves convergence ceiling and training stability compared with approaches that lean too heavily on privileged teacher signals.




