Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

arXiv cs.AI / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates LLM-as-a-Judge reliability by showing that LLM judges have systematic biases, undermining the trustworthiness of output evaluations.
  • A systematic comparison of nine debiasing strategies across five judge models and multiple benchmarks finds style bias is the dominant issue (0.76–0.92), while position bias is minimal (≤0.04).
  • The study also finds conciseness-related behavior in expansion pairs, but truncation controls indicate judges still distinguish quality from length with high accuracy (0.92–1.00).
  • Debiasing strategies help, but improvements are model-dependent: a combined budget strategy yields a statistically significant gain for Claude Sonnet 4 (+11.2 percentage points, p < 0.0001), with few configurations worsening agreement.
  • The authors release an evaluation framework, controlled dataset, and all experimental artifacts to support further research and replication.

Abstract

LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empirical study comparing nine debiasing strategies across five judge models from four provider families (Google, Anthropic, OpenAI, Meta), three benchmarks (MT-Bench n=400, LLMBar n=200, custom n=225), and four bias types. Our key findings: (1) Style bias is the dominant bias (0.76-0.92 across all models), far exceeding position bias (<= 0.04), yet has received minimal research attention. (2) All models show a conciseness preference on expansion pairs, but truncation controls confirm they correctly distinguish quality from length (0.92-1.00 accuracy), suggesting quality-sensitive evaluation rather than a simple length bias. (3) Debiasing is beneficial but model-dependent: the combined budget strategy significantly improves Claude Sonnet 4 by +11.2 pp (p < 0.0001), with directionally positive trends for other models. Only 2 of 20 non-baseline configurations show decreased agreement. We release our evaluation framework, controlled dataset, and all experimental artifacts at https://github.com/sksoumik/llm-as-judge.