Embarrassingly Simple Self-Distillation Improves Code Generation

arXiv cs.CL / 4/3/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes “simple self-distillation (SSD),” where an LLM generates code samples using specific decoding settings and then undergoes standard supervised fine-tuning on those self-generated outputs, without a separate teacher model or verifier.
SSD substantially improves code-generation performance for Qwen3-30B-Instruct, raising pass@1 from 42.4% to 55.3% on LiveCodeBench v6, with the largest gains on harder problems.
The method generalizes across multiple Qwen and Llama model sizes (4B, 8B, 30B) and across both instruct and “thinking” variants, suggesting the approach is broadly applicable.
The authors explain the gains by analyzing a “precision-exploration conflict” in decoding and argue that SSD reshapes token distributions in a context-dependent way—reducing distractor tails when precision matters while retaining useful diversity when exploration helps.
Overall, SSD is presented as a complementary post-training technique for improving LLM code generation using only the model’s own raw outputs.

Abstract

Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.

Why I built an AI assistant that doesn't know who you are

Dev.to

DenseNet Paper Walkthrough: All Connected

Towards Data Science

Meta Adaptive Ranking Model: What Instagram Advertisers Gain in 2026 | MKDM

Dev.to

The Facebook insider building content moderation for the AI era

TechCrunch

Qwen3.5 vs Gemma 4: Benchmarks vs real world use?

Reddit r/LocalLLaMA

Embarrassingly Simple Self-Distillation Improves Code Generation

Key Points

Abstract

Related Articles

Why I built an AI assistant that doesn't know who you are

DenseNet Paper Walkthrough: All Connected

Meta Adaptive Ranking Model: What Instagram Advertisers Gain in 2026 | MKDM

The Facebook insider building content moderation for the AI era

Qwen3.5 vs Gemma 4: Benchmarks vs real world use?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer