DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning
arXiv cs.CL / 3/13/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- DeReason introduces a difficulty-aware data decoupling strategy that splits training data into reasoning-intensive and non-reasoning-intensive subsets using LLM-based scoring to tailor SFT and RL training.
- The paper finds that applying RL directly to base models is sample-inefficient for general STEM and often outperformed by SFT on moderate-quality responses, but that sequential SFT followed by RL can yield additional gains.
- By assigning broad, non-reasoning-intensive problems to SFT to build foundational knowledge and reserving difficult problems for RL, DeReason achieves better performance than SFT-only, RL-only, or randomly split baselines.
- Extensive experiments on general STEM and mathematical benchmarks demonstrate the effectiveness and generality of this decoupled curriculum as a practical post-training recipe for enhancing general reasoning in LLMs.
Related Articles
The programming passion is melting
Dev.to
Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more
Reddit r/LocalLLaMA