Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
arXiv cs.LG / 4/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses how to allocate extra computation at inference time for reasoning-focused LLMs when total compute budgets are limited, deciding which inputs merit more compute versus cheaper answers.
- It formulates adaptive test-time compute allocation as a constrained optimization problem (maximize expected accuracy under an average compute budget) and solves it using a two-stage Solve-then-Learn approach.
- In the Solve stage, Lagrangian relaxation breaks the global budget constraint into per-instance subproblems, yielding closed-form oracle actions and proving monotonic cost behavior in the dual variable for precise budget targeting.
- In the Learn stage, the method trains a lightweight classifier to predict oracle actions from inexpensive input features, enabling efficient real-time deployment while bounding the learned policy’s regret via imitation error.
- Experiments on MATH and GSM8K using DeepSeek-V3, GPT-4o-mini, and Qwen2.5-7B show consistent gains over uniform/heuristic baselines, including up to a 12.8% relative accuracy improvement on MATH under matched budgets and over 91% imitation accuracy.


![[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Flu4b6ttuhur71z5gemm0.png&w=3840&q=75)
