Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference

arXiv cs.LG / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper shows that the benefit of repeated LLM inference (repeated sampling at test time) depends on the latent distribution of correctness across examples, not just on one-call accuracy.
By using one and two labeled inference calls, the authors estimate the first and second moments of latent success probability, allowing them to characterize same-example correctness correlation and distinguish stable errors from recoverable randomness.
They derive distribution-free, exact two-call bounds for any fixed majority-vote compute budget, using a moment-problem reduction to three-atom extremizers with quadratic dual certificates.
For the first practical majority-vote budget (three votes), they provide a closed-form interval with small width and a certified-improvement criterion, while also analyzing the infinite-vote limit and its strong sensitivity to latent mass near q=1/2.
Experiments on LLM inference over QNLI and QQP, including maximum-entropy and LDGP completions, indicate that observed three- and five-vote accuracies fall within the predicted two-call regions, and that temperature or randomized model mixtures can yield gains not implied by one-call accuracy ordering.

Abstract

Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeated LLM inference under conditional-i.i.d. calls. One labeled call identifies the mean latent success probability; two labeled calls identify its second moment and hence the same-example correctness correlation that separates stable errors from recoverable call-level randomness. From these two moments, every fixed majority-vote budget has a sharp distribution-free two-call interval. The key technical reduction is that the infinite-dimensional moment problem has three-atom extremizers and quadratic dual certificates for every finite budget, so the bounds are exact rather than discretized or parametric. The first useful budget, three votes, has a closed form, width at most

1/8

, and a certified-improvement criterion. The infinite-vote endpoint is the limit of majority voting as the number of calls tends to infinity; it is also sharply bounded, but remains threshold-sensitive because it depends on latent mass around

q=1/2

. We add maximum-entropy and Latent-difficulty Gaussian-probit (LDGP) point completions, and experiments on LLM calls over QNLI and QQP show that empirical three- and five-vote accuracies are contained in the projected two-call regions while temperature changes and randomized model mixtures can create voting gains not ordered by one-call accuracy.

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability

MarkTechPost

Solidity LM surpasses Opus

Reddit r/LocalLLaMA

Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference

Key Points

Abstract

Related Articles

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability

Solidity LM surpasses Opus

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer