Self-Distilled RLVR

arXiv cs.LG / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper contrasts on-policy distillation (OPD) for LLMs with RLVR, noting that distillation provides dense trajectory-level signals while RLVR offers sparse, verifiable reward signals.
It argues that prior on-policy self-distillation (OPSD), where the same model acts as teacher and student using privileged information, can cause severe information leakage and unstable long-term training when signals are derived only from the privileged teacher.
To address these issues, the authors propose RLSD (RLVR with Self-Distillation), which uses self-distillation primarily to produce token-level policy difference signals that set fine-grained update magnitudes.
RLSD still relies on RLVR-style environmental feedback (e.g., response correctness) to determine reliable update directions, combining stability from RLVR with richer learning signals from self-distillation.
The paper reports that this hybrid approach improves convergence ceiling and training stability compared with approaches that lean too heavily on privileged teacher signals.

Abstract

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.