Avoiding Overthinking and Underthinking: Curriculum-Aware Budget Scheduling for LLMs

arXiv cs.CL / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper argues that current LLM test-time reasoning methods waste tokens by using fixed or uniformly sampled budgets, causing overthinking on easy problems and underthinking on hard ones.
It proposes Budget-Adaptive Curriculum Reasoning (BCAE/BACR), which improves both reasoning quality and token efficiency using (1) budget-conditioned unified policies, (2) a curriculum-aware budget scheduler driven by learning progress, and (3) truncation-aware dense rewards with process-level verification.
The work introduces Budget-Conditioned Advantage Estimation to reduce gradient variance by conditioning the advantage baseline on the sampled budget.
Experiments on mathematical reasoning benchmarks (MATH, GSM8K, AIME, and Minerva Math) show consistent improvements across token budgets, including up to an 8.3% accuracy gain under tight budgets, alongside a 34% reduction in average token usage versus unconstrained reasoning.

Abstract

Scaling test-time compute via extended reasoning has become a key paradigm for improving the capabilities of large language models (LLMs). However, existing approaches optimize reasoning under fixed or uniformly sampled token budgets, ignoring the fundamental mismatch between problem difficulty and allocated compute. This leads to overthinking on easy problems and underthinking on hard ones, resulting in suboptimal token efficiency across diverse reasoning scenarios. In this paper, we propose Budget-Adaptive Curriculum Reasoning (BCAE), a unified framework that jointly optimizes reasoning quality and token efficiency through three synergistic components: (1) a \emph{budget-conditioned unified policy} that embeds the token budget as a continuous conditioning signal, eliminating the need for decoupled thinking and summarization strategies; (2) a \emph{curriculum-aware budget scheduler} that adaptively shifts the training budget distribution from easy to hard problems based on real-time learning progress; and (3) a \emph{truncation-aware dense reward} mechanism that provides fine-grained credit assignment at intermediate reasoning steps via process-level verification. We further introduce \emph{Budget-Conditioned Advantage Estimation} (BCAE), a novel variance reduction technique that conditions the advantage baseline on the sampled budget, yielding more stable policy gradients. Experiments on mathematical reasoning benchmarks (MATH, GSM8K, AIME, and Minerva Math) demonstrate that BACR consistently outperforms other strong baselines across all token budgets, achieving up to 8.3\% accuracy improvement under tight budgets while reducing average token consumption by 34\% compared to unconstrained reasoning.