JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces JudgeSense, a benchmark framework that measures how stable an LLM-as-a-judge’s verdicts are when prompts are paraphrased but semantically equivalent.
It defines the Judge Sensitivity Score (JSS) as the fraction of paraphrase pairs that receive identical decisions from the judge model.
Across nine judge models tested on 494 validated paraphrase pairs, the results show that coherence is where judges differ meaningfully (JSS from 0.389 to 0.992), while factuality is initially clustered around ~0.63.
The factuality instability is largely attributed to a polarity-inverted prompt artifact; after correcting this artifact, factuality consistency improves to roughly ~0.9.
For pairwise tasks like preference and relevance, most judges (8 of 9) show degenerate behavior by always choosing the same outcome, revealing strong position bias, and the authors release code, decision logs, and a validated dataset for standardized reporting.

Abstract

Large language models are increasingly deployed as automated judges for evaluating other models, yet the stability of their verdicts under semantically equivalent prompt paraphrases remains unmeasured. We introduce JudgeSense, a framework and benchmark for quantifying this property via the Judge Sensitivity Score (JSS), defined as the fraction of paraphrase pairs on which a judge returns an identical decision. Evaluating nine judge models on 494 validated paraphrase pairs, we find that coherence is the only task where judges meaningfully differ, with JSS ranging from 0.389 to 0.992. On factuality, all judges cluster near JSS about 0.63, driven by a polarity-inverted prompt artifact; after correction, factuality JSS rises to about 0.9. Pairwise tasks (preference and relevance) exhibit degenerate always-A behavior in 8 of 9 judges, indicating strong position bias. Model scale does not predict consistency. We release code, decision logs, and a validated paraphrase dataset to support standardized JSS reporting.

LLMs will be a commodity

Reddit r/artificial

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

Dex lands $5.3M to grow its AI-driven talent matching platform

Tech.eu

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring

Dev.to

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

Key Points

Abstract

Related Articles

LLMs will be a commodity

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Dex lands $5.3M to grow its AI-driven talent matching platform

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer