Mediocrity is the key for LLM as a Judge Anchor Selection
arXiv cs.CL / 3/18/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper systematically evaluates 22 anchors on the Arena-Hard-v2.0 dataset and finds anchor choice critically affects the reliability of model rankings compared to human judgments.
- It shows that common anchors, such as best-performing or worst-performing models, are poor anchors because they are extreme and fail to reflect the relative ordering of most models.
- The study finds that the effect size of anchor selection is comparable to the effect of selecting the judge model, underscoring its importance in benchmark design.
- A power analysis demonstrates that standard benchmark sizes are insufficient for reliable pairwise evaluation and cannot reliably distinguish between competitive models.
- The authors provide actionable recommendations, including guidelines for selecting informative anchors and ensuring benchmark sizes are sufficient for reliable and efficient evaluation.
Related Articles

Astral to Join OpenAI
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA

Why Data is Important for LLM
Dev.to

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.
Dev.to

YouTube's Deepfake Shield for Politicians Changes Evidence Forever
Dev.to