DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning
arXiv cs.CL / 3/13/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- DeReason introduces a difficulty-aware data decoupling strategy that splits training data into reasoning-intensive and non-reasoning-intensive subsets using LLM-based scoring to tailor SFT and RL training.
- The paper finds that applying RL directly to base models is sample-inefficient for general STEM and often outperformed by SFT on moderate-quality responses, but that sequential SFT followed by RL can yield additional gains.
- By assigning broad, non-reasoning-intensive problems to SFT to build foundational knowledge and reserving difficult problems for RL, DeReason achieves better performance than SFT-only, RL-only, or randomly split baselines.
- Extensive experiments on general STEM and mathematical benchmarks demonstrate the effectiveness and generality of this decoupled curriculum as a practical post-training recipe for enhancing general reasoning in LLMs.
Related Articles
The massive shift toward edge computing and local processing
Dev.to
Self-Refining Agents in Spec-Driven Development
Dev.to
Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs
Dev.to
The Three-Agent Protocol Is Transferable. The Discipline Isn't.
Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop
Reddit r/LocalLLaMA