When LLM Judge Scores Look Good but Best-of-N Decisions Fail
arXiv cs.AI / 3/16/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- Large language models are used as judges to score candidate responses, but relying on global metrics alone can be misleading for best-of-n selection tasks within prompts.
- In a 5,000-prompt best-of-4 benchmark, a judge with moderate global correlation (r = 0.47) captures only about 21% of the potential improvement from perfect selection over random choice.
- The shortfall arises because global agreement is driven by prompt-level baseline effects, while effective selection depends on within-prompt ranking (within-prompt correlation r_within ≈ 0.27) and a high rate of ties in pairwise comparisons (≈67%).
- Using explicit pairwise judging in matched-pair best-of-2 audits recovers much of the lost signal, increasing recovery from ~21% to ~61%, and suggesting audits should report within-prompt signal, tie rates, and top-1 recovery rather than global agreement alone.
Related Articles
The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M
Dev.to
How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command
Dev.to
Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets
Dev.to
Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
Dev.to
What CVE-2026-25253 Taught Me About Building Safe AI Assistants
Dev.to