How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
arXiv cs.CL / 4/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- Synthetic data generated by a stronger “teacher” model for supervised fine-tuning (SFT) can hurt reasoning models, especially when the teacher and student output distributions differ.
- The paper finds that major stylistic divergence between teacher-generated data and student token distributions is a key reason SFT often fails for newer reasoning models like Qwen3-8B.
- It proposes TESSY (Teacher-Student Cooperation Data Synthesis), which alternates between teacher and student models to generate style vs. non-style tokens to better match the student’s style distribution.
- Experiments on code generation show that fine-tuning Qwen3-8B with plain teacher data decreases performance, while using TESSY improves LiveCodeBench-Pro by 11.25% and OJBench by 6.68%.
- The results suggest that controlling style/distribution alignment during synthetic-data generation is critical for reliably transferring reasoning capabilities via SFT.
Related Articles
langchain-anthropic==1.4.1
LangChain Releases
🚀 Anti-Gravity Meets Cloud AI: The Future of Effortless Development
Dev.to
Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs
Dev.to
AI Will Run Companies. Here's Why That Should Excite You, Not Scare You.
Dev.to
The problem with Big Tech AI pricing (and why 8 countries can't afford to compete)
Dev.to