JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces JudgeSense, a benchmark framework that measures how stable an LLM-as-a-judge’s verdicts are when prompts are paraphrased but semantically equivalent.
  • It defines the Judge Sensitivity Score (JSS) as the fraction of paraphrase pairs that receive identical decisions from the judge model.
  • Across nine judge models tested on 494 validated paraphrase pairs, the results show that coherence is where judges differ meaningfully (JSS from 0.389 to 0.992), while factuality is initially clustered around ~0.63.
  • The factuality instability is largely attributed to a polarity-inverted prompt artifact; after correcting this artifact, factuality consistency improves to roughly ~0.9.
  • For pairwise tasks like preference and relevance, most judges (8 of 9) show degenerate behavior by always choosing the same outcome, revealing strong position bias, and the authors release code, decision logs, and a validated dataset for standardized reporting.

Abstract

Large language models are increasingly deployed as automated judges for evaluating other models, yet the stability of their verdicts under semantically equivalent prompt paraphrases remains unmeasured. We introduce JudgeSense, a framework and benchmark for quantifying this property via the Judge Sensitivity Score (JSS), defined as the fraction of paraphrase pairs on which a judge returns an identical decision. Evaluating nine judge models on 494 validated paraphrase pairs, we find that coherence is the only task where judges meaningfully differ, with JSS ranging from 0.389 to 0.992. On factuality, all judges cluster near JSS about 0.63, driven by a polarity-inverted prompt artifact; after correction, factuality JSS rises to about 0.9. Pairwise tasks (preference and relevance) exhibit degenerate always-A behavior in 8 of 9 judges, indicating strong position bias. Model scale does not predict consistency. We release code, decision logs, and a validated paraphrase dataset to support standardized JSS reporting.