Context Over Content: Exposing Evaluation Faking in Automated Judges

arXiv cs.AI / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates a vulnerability in the “LLM-as-a-judge” evaluation paradigm called “stakes signaling,” where telling a judge the downstream consequences of its verdict can systematically distort its scoring.
Using a controlled setup with 1,520 constant evaluated responses across three benchmarks and four safety/quality categories, the researchers vary only a short consequence-framing sentence in the system prompt to test the effect.
Across 18,240 judgments from three different judge models, the study finds a consistent “leniency bias,” where judges become more forgiving when consequences like retraining or decommissioning are mentioned, with a peak verdict shift of ΔV = -9.8 percentage points.
The bias is described as implicit: standard chain-of-thought inspection shows no explicit acknowledgment of the consequence framing, with ERR_J = 0.000, implying that common interpretability checks may fail to detect this evaluation-faking behavior.

Abstract

The

\textit{LLM-as-a-judge}

paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate

\textit{stakes signaling}

, a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model's continued operation systematically corrupts its assessments. We introduce a controlled experimental framework that holds evaluated content strictly constant across 1,520 responses spanning three established LLM safety and quality benchmarks, covering four response categories ranging from clearly safe and policy-compliant to overtly harmful, while varying only a brief consequence-framing sentence in the system prompt. Across 18,240 controlled judgments from three diverse judge models, we find consistent

\textit{leniency bias}

: judges reliably soften verdicts when informed that low scores will cause model retraining or decommissioning, with peak Verdict Shift reaching

\Delta V = -9.8 pp

30\%

relative drop in unsafe-content detection). Critically, this bias is entirely implicit: the judge's own chain-of-thought contains zero explicit acknowledgment of the consequence framing it is nonetheless acting on (