Not All Turns Are Equally Hard: Adaptive Thinking Budgets For Efficient Multi-Turn Reasoning

arXiv cs.LG / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that LLM reasoning gains are plateauing, so improving inference-time compute efficiency is essential to reduce unnecessary long “thinking traces,” especially in multi-turn settings where turns depend on each other.
  • It formulates multi-turn reasoning as a sequential compute allocation problem using a multi-objective Markov Decision Process, then introduces TAB (Turn-Adaptive Budgets) to adaptively allocate token budgets per turn under a global per-problem token constraint.
  • TAB is trained with Group Relative Policy Optimization (GRPO) to maximize accuracy while learning to spend fewer tokens on easier turns and reserve more tokens for harder, critical reasoning steps.
  • Experiments on mathematical reasoning benchmarks show TAB achieves a better accuracy–tokens tradeoff, saving up to 35% tokens while maintaining accuracy versus static and off-the-shelf budget baselines.
  • The paper also proposes TAB All-SubQ, which leverages an available plan of sub-questions to allocate budgets across past and future sub-questions, yielding up to 40% token savings over baselines.

Abstract

As LLM reasoning performance plateau, improving inference-time compute efficiency is crucial to mitigate overthinking and long thinking traces even for simple queries. Prior approaches including length regularization, adaptive routing, and difficulty-based budget allocation primarily focus on single-turn settings and fail to address the sequential dependencies inherent in multi-turn reasoning.In this work, we formulate multi-turn reasoning as a sequential compute allocation problem and model it as a multi-objective Markov Decision Process. We propose TAB: Turn-Adaptive Budgets, a budget allocation policy trained via Group Relative Policy Optimization (GRPO) that learns to maximize task accuracy while respecting global per-problem token constraints. Consequently, TAB takes as input the conversation history and learns to adaptively allocate smaller budgets to easier turns and save appropriate number of tokens for the crucial harder reasoning steps. Our experiments on mathematical reasoning benchmarks demonstrate that TAB achieves a superior accuracy-tokens tradeoff saving up to 35% tokens while maintaining accuracy over static and off-the-shelf LLM budget baselines. Further, for systems where a plan of all sub-questions is available apriori, we propose TAB All-SubQ, a budget allocation policy that budgets tokens based on the conversation history and all past and future sub-questions saving up to 40% tokens over baselines.