Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability
arXiv cs.AI / 4/10/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper challenges the common post-training claim that supervised fine-tuning (SFT) memorizes while reinforcement learning (RL) generalizes, showing that reasoning SFT can generalize across domains but only under certain conditions.
- It finds cross-domain generalization may exhibit a “dip-and-recovery” pattern during training, meaning short training checkpoints can falsely suggest poor generalization.
- Optimization dynamics, training-data quality/structure, and the base model’s capability jointly determine whether long-chain-of-thought (CoT) reasoning SFT transfers procedures effectively.
- Verified long-CoT traces improve cross-domain performance, while low-quality solutions can harm generalization broadly.
- The study observes asymmetric tradeoffs: reasoning quality improves, but safety can degrade, reframing the evaluation as a “when and at what cost” question for reasoning SFT generalization.
Related Articles

Black Hat Asia
AI Business
v0.20.5
Ollama Releases

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS
Dev.to
Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.
Reddit r/LocalLLaMA
SoloEngine: Low-Code Agentic AI Development Platform with Native Support for Multi-Agent Collaboration, MCP, and Skill System
Dev.to