JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
arXiv cs.CL / 4/28/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces JudgeSense, a benchmark framework that measures how stable an LLM-as-a-judge’s verdicts are when prompts are paraphrased but semantically equivalent.
- It defines the Judge Sensitivity Score (JSS) as the fraction of paraphrase pairs that receive identical decisions from the judge model.
- Across nine judge models tested on 494 validated paraphrase pairs, the results show that coherence is where judges differ meaningfully (JSS from 0.389 to 0.992), while factuality is initially clustered around ~0.63.
- The factuality instability is largely attributed to a polarity-inverted prompt artifact; after correcting this artifact, factuality consistency improves to roughly ~0.9.
- For pairwise tasks like preference and relevance, most judges (8 of 9) show degenerate behavior by always choosing the same outcome, revealing strong position bias, and the authors release code, decision logs, and a validated dataset for standardized reporting.
Related Articles
LLMs will be a commodity
Reddit r/artificial
Indian Developers: How to Build AI Side Income with $0 Capital in 2026
Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally
Reddit r/LocalLLaMA

Dex lands $5.3M to grow its AI-driven talent matching platform
Tech.eu
AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring
Dev.to