Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

arXiv cs.AI / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper challenges the common post-training claim that supervised fine-tuning (SFT) memorizes while reinforcement learning (RL) generalizes, showing that reasoning SFT can generalize across domains but only under certain conditions.
  • It finds cross-domain generalization may exhibit a “dip-and-recovery” pattern during training, meaning short training checkpoints can falsely suggest poor generalization.
  • Optimization dynamics, training-data quality/structure, and the base model’s capability jointly determine whether long-chain-of-thought (CoT) reasoning SFT transfers procedures effectively.
  • Verified long-CoT traces improve cross-domain performance, while low-quality solutions can harm generalization broadly.
  • The study observes asymmetric tradeoffs: reasoning quality improves, but safety can degrade, reframing the evaluation as a “when and at what cost” question for reasoning SFT generalization.

Abstract

A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.