Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference

arXiv cs.LG / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper shows that the benefit of repeated LLM inference (repeated sampling at test time) depends on the latent distribution of correctness across examples, not just on one-call accuracy.
  • By using one and two labeled inference calls, the authors estimate the first and second moments of latent success probability, allowing them to characterize same-example correctness correlation and distinguish stable errors from recoverable randomness.
  • They derive distribution-free, exact two-call bounds for any fixed majority-vote compute budget, using a moment-problem reduction to three-atom extremizers with quadratic dual certificates.
  • For the first practical majority-vote budget (three votes), they provide a closed-form interval with small width and a certified-improvement criterion, while also analyzing the infinite-vote limit and its strong sensitivity to latent mass near q=1/2.
  • Experiments on LLM inference over QNLI and QQP, including maximum-entropy and LDGP completions, indicate that observed three- and five-vote accuracies fall within the predicted two-call regions, and that temperature or randomized model mixtures can yield gains not implied by one-call accuracy ordering.

Abstract

Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeated LLM inference under conditional-i.i.d. calls. One labeled call identifies the mean latent success probability; two labeled calls identify its second moment and hence the same-example correctness correlation that separates stable errors from recoverable call-level randomness. From these two moments, every fixed majority-vote budget has a sharp distribution-free two-call interval. The key technical reduction is that the infinite-dimensional moment problem has three-atom extremizers and quadratic dual certificates for every finite budget, so the bounds are exact rather than discretized or parametric. The first useful budget, three votes, has a closed form, width at most 1/8, and a certified-improvement criterion. The infinite-vote endpoint is the limit of majority voting as the number of calls tends to infinity; it is also sharply bounded, but remains threshold-sensitive because it depends on latent mass around q=1/2. We add maximum-entropy and Latent-difficulty Gaussian-probit (LDGP) point completions, and experiments on LLM calls over QNLI and QQP show that empirical three- and five-vote accuracies are contained in the projected two-call regions while temperature changes and randomized model mixtures can create voting gains not ordered by one-call accuracy.