Strategic Scaling of Test-Time Compute: A Bandit Learning Approach

arXiv stat.ML / 4/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that scaling test-time compute for LLMs is more efficient when compute allocation adapts to each query’s difficulty rather than using uniform compute for all inputs.
  • It formulates adaptive test-time compute allocation as a bandit learning problem, using on-the-fly estimates of query difficulty to decide how much computation to spend.
  • The proposed approach allocates more compute to harder queries while limiting spending on easier ones, improving overall compute efficiency without sacrificing accuracy.
  • For difficult queries, the method further learns to prioritize instances that are solvable, reducing waste on unsolvable cases.
  • The authors provide theoretical guarantees of better compute efficiency than uniform allocation and validate performance gains on math and code benchmarks, including up to ~11% absolute improvements on MATH-500, AIME25, and LiveCodeBench.

Abstract

Scaling test-time compute has emerged as an effective strategy for improving the performance of large language models. However, existing methods typically allocate compute uniformly across all queries, overlooking variation in query difficulty. To address this inefficiency, we formulate test-time compute allocation as a novel bandit learning problem and propose adaptive algorithms that estimate query difficulty on the fly and allocate compute accordingly. Compared to uniform allocation, our algorithms allocate more compute to challenging queries while maintaining accuracy on easier ones. Among challenging queries, our algorithms further learn to prioritize solvable instances, effectively reducing excessive computing on unsolvable queries. We theoretically prove that our algorithms achieve better compute efficiency than uniform allocation and empirically validate their effectiveness on math and code benchmarks. Specifically, our algorithms achieve up to an 11.10% performance improvement (15.04% relative) on the MATH-500 dataset, up to 10.82% (14.44% relative) on the AIME25 dataset, and up to an 11.23% performance improvement (15.29% relative) on the LiveCodeBench dataset.