Guided Speculative Inference for Efficient Test-Time Alignment of LLMs

arXiv stat.ML / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Guided Speculative Inference (GSI), an algorithm for efficient reward-guided decoding in large language models at test time.
  • GSI uses a soft best-of-n strategy combined with a reward model r(x,y) and speculative candidate samples generated by a smaller auxiliary model π_S(y|x).
  • The authors provide provable approximations of the optimal tilted policy (based on exp(β·r(x,y))) and of the expected reward under that optimal policy.
  • Experiments across multiple reasoning and academic benchmarks show GSI improves accuracy over standard soft best-of-n (using the auxiliary model) and over reward-guided speculative decoding, and can even beat soft best-of-n using the base model in some settings.
  • Reported end-to-end latency is reduced by up to 28%, and the authors released code on GitHub.

Abstract

We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of-n test-time scaling with a reward model r(x,y) and speculative samples from a small auxiliary model \pi_S(y\mid x). We provably approximate both the optimal tilted policy \pi_{\beta,B}(y\mid x) \propto \pi_B(y\mid x)\exp(\beta\,r(x,y)) of soft best-of-n under the base model \pi_B, as well as the expected reward under the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math, MMLU-STEM, GSM8K) and across different model families, our method achieves higher accuracy than standard soft best-of-n with \pi_S and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of-n with \pi_B, while reducing end-to-end latency by up to 28\%. The code is available at https://github.com/j-geuter/GSI .