How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study argues that LLM judge configuration in safety benchmarks (judge model plus judge prompt) should not be treated as a fixed implementation detail because it materially affects results.
Using a factorial experiment, the researchers generated 12 judge prompt variants across different evaluation structures and instruction framing, running 28,812 judgments with Claude Sonnet 4-6 across six target models and 400 HarmBench behaviors.
They found that changing only the prompt wording (while keeping the judge model fixed) can move measured harmful-response rates by as much as 24.2 percentage points, and even minor rewording can swing results by up to 20.1 percentage points.
Safety rankings were found to be moderately unstable (mean Kendall tau = 0.89), with sensitivity varying by category (e.g., large shifts for copyright and no change for harassment in their results).
A supplementary experiment using multiple judge models indicates that judge-model selection introduces additional variance, highlighting prompt wording as a major and under-examined source of measurement error in safety benchmarking.

Abstract

Safety benchmarks such as HarmBench rely on LLM judges to classify model responses as harmful or safe, yet the judge configuration, namely the combination of judge model and judge prompt, is typically treated as a fixed implementation detail. We show this assumption is problematic. Using a 2 x 2 x 3 factorial design, we construct 12 judge prompt variants along two axes, evaluation structure and instruction framing, and apply them using a single judge model, Claude Sonnet 4-6, producing 28,812 judgments over six target models and 400 HarmBench behaviors. We find that prompt wording alone, holding the judge model fixed, shifts measured harmful-response rates by up to 24.2 percentage points, with even within-condition surface rewording causing swings of up to 20.1 percentage points. Model safety rankings are moderately unstable, with mean Kendall tau = 0.89, and category-level sensitivity ranges from 39.6 percentage points for copyright to 0 percentage points for harassment. A supplementary multi-judge experiment using three judge models shows that judge-model choice adds further variance. Our results demonstrate that judge prompt wording is a substantial, previously under-examined source of measurement variance in safety benchmarking.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Everyone Wants AI Agents. Fewer Teams Are Ready for the Messy Business Context Behind Them

Dev.to

AI 编程工具对比 2026：Claude Code vs Cursor vs Gemini CLI vs Codex

Dev.to

How I Improved My YouTube Shorts and Podcast Audio Workflow with AI Tools

Dev.to

An improvement of the convergence proof of the ADAM-Optimizer

Dev.to

How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Everyone Wants AI Agents. Fewer Teams Are Ready for the Messy Business Context Behind Them

AI 编程工具对比 2026：Claude Code vs Cursor vs Gemini CLI vs Codex

How I Improved My YouTube Shorts and Podcast Audio Workflow with AI Tools

An improvement of the convergence proof of the ADAM-Optimizer

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer