When LLM Judge Scores Look Good but Best-of-N Decisions Fail
arXiv cs.AI / 3/16/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- Large language models are used as judges to score candidate responses, but relying on global metrics alone can be misleading for best-of-n selection tasks within prompts.
- In a 5,000-prompt best-of-4 benchmark, a judge with moderate global correlation (r = 0.47) captures only about 21% of the potential improvement from perfect selection over random choice.
- The shortfall arises because global agreement is driven by prompt-level baseline effects, while effective selection depends on within-prompt ranking (within-prompt correlation r_within ≈ 0.27) and a high rate of ties in pairwise comparisons (≈67%).
- Using explicit pairwise judging in matched-pair best-of-2 audits recovers much of the lost signal, increasing recovery from ~21% to ~61%, and suggesting audits should report within-prompt signal, tie rates, and top-1 recovery rather than global agreement alone.
Related Articles

Astral to Join OpenAI
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA

Why Data is Important for LLM
Dev.to

Waymo hits 170 million miles while avoiding serious mayhem
The Verge

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.
Dev.to