An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2

arXiv cs.CL / 4/16/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper empirically tests drop-in “LLM-as-a-judge” prompting and aggregation strategies to improve GPT-5.4 judge reliability on RewardBench 2 without any fine-tuning.
  • Two techniques drive most gains: task-specific criteria injection improves accuracy by about +3.0 percentage points at negligible cost, while ensemble scoring improves by about +9.8 points at roughly 5x cost.
  • Using criteria injection plus ensembling together yields 83.6% accuracy, which is +11.9 points over a 71.7% baseline.
  • Additional methods evaluated (calibration context, adaptive model escalation, and soft blending) did not consistently match the improvements of criteria injection and ensembling at comparable cost.
  • Ensembling benefits cheaper model tiers disproportionately, enabling near-high accuracy at lower spend (e.g., GPT-5.4 mini k=8 at 79.2% with ~1.2x baseline cost; GPT-5.4 nano k=8 at 71.4% with ~0.4x baseline cost).

Abstract

LLM-as-a-judge, using a language model to score or rank candidate responses, is widely used as a scalable alternative to human evaluation in RLHF pipelines, benchmarking, and application layer evaluations (evals). However, judgment reliability depends heavily on prompting and aggregation strategy. We present an empirical investigation of practical, drop-in techniques that improve GPT-5.4 judge accuracy on RewardBench 2 without any finetuning. Two techniques account for nearly all available gains: task-specific criteria injection (+3.0pp at negligible cost) and ensemble scoring (+9.8pp at 5x cost). Combined, they reach 83.6% accuracy, +11.9pp over the 71.7% baseline. Our investigation also covers three further techniques (calibration context, adaptive model escalation, and soft blending) which did not reliably improve on criteria + ensembling at comparable cost. Cheaper model tiers benefit disproportionately from ensembling: GPT-5.4 mini with k=8 achieves 79.2% at 1.2x baseline cost, and GPT-5.4 nano with k=8 reaches 71.4% at 0.4x baseline cost, making high-accuracy LLM judges accessible at low cost.