Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective

arXiv cs.LG / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper revisits the previously reported “capacity gap” problem in chain-of-thought (CoT) distillation, focusing on how capability mismatch between teacher and student affects distillation outcomes in practice.
  • It finds that CoT distillation can frequently degrade performance relative to the student’s pre-distillation baseline, and that this degradation is often hidden when studies only report post-distillation comparisons.
  • The authors propose a more realistic evaluation protocol to better capture baseline regression and to make capacity-gap effects more observable.
  • They conclude that capacity-gap impacts do not uniformly dominate across all tasks/settings and can be mitigated or modulated, particularly when candidate teachers differ substantially in performance.
  • The work provides practical guidance for selecting teacher–student pairs for more reliable CoT distillation results.

Abstract

Chain-of-thought (CoT) distillation transfers reasoning behaviors from a strong teacher to a smaller student, but prior work reports a capacity gap: distillation may fail when the teacher-student capability mismatch is large. We revisit the capacity gap from a practical perspective by re-examining commonly used experimental settings. Notably, we find that CoT distillation often degrades performance compared to the student's pre-distillation baseline, an issue obscured when only post-distillation comparisons are reported. We therefore propose a more realistic evaluation protocol and find that the impact of capacity gap effects does not consistently dominate across tasks and settings, especially when candidate teachers differ substantially in performance. Our results offer practical guidance for selecting teacher-student pairs in CoT distillation.