An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2

arXiv cs.CL / 4/16/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper empirically tests drop-in “LLM-as-a-judge” prompting and aggregation strategies to improve GPT-5.4 judge reliability on RewardBench 2 without any fine-tuning.
Two techniques drive most gains: task-specific criteria injection improves accuracy by about +3.0 percentage points at negligible cost, while ensemble scoring improves by about +9.8 points at roughly 5x cost.
Using criteria injection plus ensembling together yields 83.6% accuracy, which is +11.9 points over a 71.7% baseline.
Additional methods evaluated (calibration context, adaptive model escalation, and soft blending) did not consistently match the improvements of criteria injection and ensembling at comparable cost.
Ensembling benefits cheaper model tiers disproportionately, enabling near-high accuracy at lower spend (e.g., GPT-5.4 mini k=8 at 79.2% with ~1.2x baseline cost; GPT-5.4 nano k=8 at 71.4% with ~0.4x baseline cost).

Abstract

LLM-as-a-judge, using a language model to score or rank candidate responses, is widely used as a scalable alternative to human evaluation in RLHF pipelines, benchmarking, and application layer evaluations (evals). However, judgment reliability depends heavily on prompting and aggregation strategy. We present an empirical investigation of practical, drop-in techniques that improve GPT-5.4 judge accuracy on RewardBench 2 without any finetuning. Two techniques account for nearly all available gains: task-specific criteria injection (+3.0pp at negligible cost) and ensemble scoring (+9.8pp at 5x cost). Combined, they reach 83.6% accuracy, +11.9pp over the 71.7% baseline. Our investigation also covers three further techniques (calibration context, adaptive model escalation, and soft blending) which did not reliably improve on criteria + ensembling at comparable cost. Cheaper model tiers benefit disproportionately from ensembling: GPT-5.4 mini with k=8 achieves 79.2% at 1.2x baseline cost, and GPT-5.4 nano with k=8 reaches 71.4% at 0.4x baseline cost, making high-accuracy LLM judges accessible at low cost.

Black Hat USA

AI Business

Black Hat Asia

AI Business

The AI Hype Cycle Is Lying to You About What to Learn

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

OpenAI Codex April 2026 Update Review: Computer Use, Memory & 90+ Plugins — Is the Hype Real?

Dev.to

An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2

Key Points

Abstract

Related Articles

Black Hat USA

Black Hat Asia

The AI Hype Cycle Is Lying to You About What to Learn

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

OpenAI Codex April 2026 Update Review: Computer Use, Memory & 90+ Plugins — Is the Hype Real?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer