Mediocrity is the key for LLM as a Judge Anchor Selection
arXiv cs.CL / 3/18/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper systematically evaluates 22 anchors on the Arena-Hard-v2.0 dataset and finds anchor choice critically affects the reliability of model rankings compared to human judgments.
- It shows that common anchors, such as best-performing or worst-performing models, are poor anchors because they are extreme and fail to reflect the relative ordering of most models.
- The study finds that the effect size of anchor selection is comparable to the effect of selecting the judge model, underscoring its importance in benchmark design.
- A power analysis demonstrates that standard benchmark sizes are insufficient for reliable pairwise evaluation and cannot reliably distinguish between competitive models.
- The authors provide actionable recommendations, including guidelines for selecting informative anchors and ensuring benchmark sizes are sufficient for reliable and efficient evaluation.
Related Articles
How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models
Reddit r/LocalLLaMA
Engenharia de Prompt: Por Que a Forma Como Você Pergunta Muda Tudo(Um guia introdutório)
Dev.to
The Obligor
Dev.to
The Markup
Dev.to
2026 年 AI 部落格變現完整攻略:從第一篇文章到月收入 $1000
Dev.to