When LLM Judge Scores Look Good but Best-of-N Decisions Fail

arXiv cs.AI / 3/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Large language models are used as judges to score candidate responses, but relying on global metrics alone can be misleading for best-of-n selection tasks within prompts.
In a 5,000-prompt best-of-4 benchmark, a judge with moderate global correlation (r = 0.47) captures only about 21% of the potential improvement from perfect selection over random choice.
The shortfall arises because global agreement is driven by prompt-level baseline effects, while effective selection depends on within-prompt ranking (within-prompt correlation r_within ≈ 0.27) and a high rate of ties in pairwise comparisons (≈67%).
Using explicit pairwise judging in matched-pair best-of-2 audits recovers much of the lost signal, increasing recovery from ~21% to ~61%, and suggesting audits should report within-prompt signal, tie rates, and top-1 recovery rather than global agreement alone.

Abstract

Large language models are often used as judges to score candidate responses, then validated with a single global metric such as correlation with reference labels. This can be misleading when the real deployment task is best-of-n selection within a prompt. In a 5,000-prompt best-of-4 benchmark from Chatbot Arena, a judge with moderate global correlation (r = 0.47) captures only 21.0% of the improvement that perfect selection would achieve over random choice. The gap arises because global agreement is driven largely by prompt-level baseline effects, while selection depends on within-prompt ranking: within-prompt correlation is only r_within = 0.27, and coarse pointwise scoring creates ties in 67% of pairwise comparisons. In a matched-pair best-of-2 audit, explicit pairwise judging recovers much of this lost signal, raising recovery from 21.1% to 61.2%. For judge-based selection, the relevant audit should report within-prompt signal, tie rates, and recovery/top-1 accuracy, not global agreement alone.

The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M

Dev.to

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Dev.to

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Dev.to

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Dev.to

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Dev.to

When LLM Judge Scores Look Good but Best-of-N Decisions Fail

Key Points

Abstract

Related Articles

The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer