AI Navigate

Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

arXiv cs.LG / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces autocurriculum, a training paradigm where the model uses its own performance signals to select which problems to focus on, enabling adaptive data selection without assuming prompt distributions or difficulty levels.
  • In supervised fine-tuning, autocurriculum dramatically reduces the required reasoning demonstrations by concentrating teacher supervision on prompts where the model currently struggles, yielding exponential gains over non-adaptive fine-tuning.
  • In reinforcement learning fine-tuning, autocurriculum decouples computational cost from target accuracy, reducing the significant burn-in cost and making it nearly independent of final model performance.
  • The improvements arise from combining ideas from boosting and learning from counterexamples, providing algorithmic efficiency gains without new assumptions about data distribution.

Abstract

Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum decouples the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, and requiring no assumption on the distribution or difficulty of prompts.