Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations

arXiv cs.AI / 4/25/2026

📰 NewsModels & Research

Key Points

  • The paper proposes a framework that adaptively allocates test-time compute while simultaneously adjusting how the model generates outputs.
  • It uses a warm-up step to find easy queries and build an initial set of question–response pairs drawn from the test set itself.
  • In the adaptive phase, additional compute is focused on unresolved queries and their generation distributions are reshaped via evolving in-context demonstrations.
  • The evolving demonstrations condition each generation on previously successful responses from semantically related queries, avoiding repeated sampling from a fixed distribution.
  • Experiments on math, coding, and reasoning benchmarks show consistent improvements over baselines while using substantially less inference-time compute.

Abstract

While scaling test-time compute can substantially improve model performance, existing approaches either rely on static compute allocation or sample from fixed generation distributions. In this work, we introduce a test-time compute allocation framework that jointly adapts where computation is spent and how generation is performed. Our method begins with a warm-up phase that identifies easy queries and assembles an initial pool of question-response pairs from the test set itself. An adaptive phase then concentrates further computation on unresolved queries while reshaping their generation distributions through evolving in-context demonstrations -- conditioning each generation on successful responses from semantically related queries rather than resampling from a fixed distribution. Experiments across math, coding, and reasoning benchmarks demonstrate that our approach consistently outperforms existing baselines while consuming substantially less inference-time compute.