Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision
arXiv cs.CL / 4/15/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes Self-Distillation Zero (SD-Zero), which converts sparse binary rewards from verifiable tasks into dense token-level supervision without needing an external teacher or high-quality demonstrations.
- SD-Zero uses a single model in two roles—a Generator to produce an initial answer and a Reviser that conditions on the generator’s response plus its binary reward to produce an improved response.
- It then performs on-policy self-distillation to transfer the Reviser’s token distributions back into the Generator, effectively training the model to localize and correct key tokens based on reward.
- Experiments on math and code reasoning benchmarks (using Qwen3-4B-Instruct and Olmo-3-7B-Instruct) show at least a 10% improvement over base models and better results than baselines like RFT, GRPO, and SDFT under the same training sample budget.
- Ablations highlight two distinctive behaviors: token-level self-localization of which response tokens to revise, and iterative self-evolution via regular teacher synchronization.
Related Articles

Black Hat Asia
AI Business

The Complete Guide to Better Meeting Productivity with AI Note-Taking
Dev.to

5 Ways Real-Time AI Can Boost Your Sales Call Performance
Dev.to

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG
Dev.to
Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]
Reddit r/MachineLearning