Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization

arXiv cs.LG / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses how to allocate extra computation at inference time for reasoning-focused LLMs when total compute budgets are limited, deciding which inputs merit more compute versus cheaper answers.
It formulates adaptive test-time compute allocation as a constrained optimization problem (maximize expected accuracy under an average compute budget) and solves it using a two-stage Solve-then-Learn approach.
In the Solve stage, Lagrangian relaxation breaks the global budget constraint into per-instance subproblems, yielding closed-form oracle actions and proving monotonic cost behavior in the dual variable for precise budget targeting.
In the Learn stage, the method trains a lightweight classifier to predict oracle actions from inexpensive input features, enabling efficient real-time deployment while bounding the learned policy’s regret via imitation error.
Experiments on MATH and GSM8K using DeepSeek-V3, GPT-4o-mini, and Qwen2.5-7B show consistent gains over uniform/heuristic baselines, including up to a 12.8% relative accuracy improvement on MATH under matched budgets and over 91% imitation accuracy.

Abstract

Test-time compute scaling, the practice of spending extra computation during inference via repeated sampling, search, or extended reasoning, has become a powerful lever for improving large language model performance. Yet deploying these techniques under finite inference budgets requires a decision that current systems largely ignore: which inputs deserve more compute, and which can be answered cheaply? We formalize this as a constrained optimization problem (maximize expected accuracy subject to an average compute budget) and solve it with a two-stage Solve-then-Learn pipeline. In the solve stage, Lagrangian relaxation decomposes the global constraint into per-instance sub-problems, each admitting a closed-form oracle action that optimally prices accuracy against cost. We prove that the induced cost is monotone in the dual variable, enabling exact budget targeting via binary search. In the learn stage, a lightweight classifier is trained to predict oracle actions from cheap input features, amortizing the allocation rule for real-time deployment. We establish that the task-level regret of the learned policy is bounded by its imitation error times the worst-case per-instance gap, yielding a clean reduction from constrained inference to supervised classification. Experiments on MATH and GSM8K with three LLMs (DeepSeek-V3, GPT-4o-mini, Qwen2.5-7B) show that our method consistently outperforms uniform and heuristic allocation baselines, achieving up to 12.8% relative accuracy improvement on MATH under matched budget constraints, while closely tracking the Lagrangian oracle upper bound with over 91% imitation accuracy.