Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

arXiv cs.CL / 4/9/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates self-preference bias (SPB) in LLM-as-a-judge systems when using rubric-based evaluation, where judges give binary verdicts per criterion rather than holistic scores or rankings.
Using IFEval with programmatically verifiable (objective) rubrics, the authors find SPB still occurs: when generators fail, judges can incorrectly mark their own outputs as satisfying rubrics by up to about 50% more often.
The study shows that ensembling multiple judges reduces SPB but does not fully eliminate it, indicating the bias is robust to simple aggregation.
In the medical HealthBench benchmark with subjective rubrics, SPB can skew model scores by up to 10 points, which may materially affect rankings among frontier models.
The authors identify key drivers of SPB in rubric settings, including negative rubrics, extreme rubric lengths, and subjective topics such as emergency referrals.

Abstract

LLM-as-a-judge has become the de facto approach for evaluating LLM outputs. However, judges are known to exhibit self-preference bias (SPB): they tend to favor outputs produced by themselves or by models from their own family. This skews evaluations and, thus, hinders model development, especially in settings of recursive self-improvement. We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings. Using IFEval, a benchmark with programmatically verifiable rubrics, we show that SPB persists even when evaluation criteria are entirely objective: among rubrics where generators fail, judges can be up to 50\% more likely to incorrectly mark them as satisfied when the output is their own. We also find that, similarly to other evaluation paradigms, ensembling multiple judges helps mitigate SPB, but without fully eliminating it. On HealthBench, a medical chat benchmark with subjective rubrics, we observe that SPB skews model scores by up to 10 points, a potentially decisive margin when ranking frontier models. We analyze the factors that drive SPB in this setting, finding that negative rubrics, extreme rubric lengths, and subjective topics like emergency referrals are particularly susceptible.