How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

arXiv cs.CL / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Synthetic data generated by a stronger “teacher” model for supervised fine-tuning (SFT) can hurt reasoning models, especially when the teacher and student output distributions differ.
  • The paper finds that major stylistic divergence between teacher-generated data and student token distributions is a key reason SFT often fails for newer reasoning models like Qwen3-8B.
  • It proposes TESSY (Teacher-Student Cooperation Data Synthesis), which alternates between teacher and student models to generate style vs. non-style tokens to better match the student’s style distribution.
  • Experiments on code generation show that fine-tuning Qwen3-8B with plain teacher data decreases performance, while using TESSY improves LiveCodeBench-Pro by 11.25% and OJBench by 6.68%.
  • The results suggest that controlling style/distribution alignment during synthetic-data generation is critical for reliably transferring reasoning capabilities via SFT.

Abstract

A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.