An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2
arXiv cs.CL / 4/16/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper empirically tests drop-in “LLM-as-a-judge” prompting and aggregation strategies to improve GPT-5.4 judge reliability on RewardBench 2 without any fine-tuning.
- Two techniques drive most gains: task-specific criteria injection improves accuracy by about +3.0 percentage points at negligible cost, while ensemble scoring improves by about +9.8 points at roughly 5x cost.
- Using criteria injection plus ensembling together yields 83.6% accuracy, which is +11.9 points over a 71.7% baseline.
- Additional methods evaluated (calibration context, adaptive model escalation, and soft blending) did not consistently match the improvements of criteria injection and ensembling at comparable cost.
- Ensembling benefits cheaper model tiers disproportionately, enabling near-high accuracy at lower spend (e.g., GPT-5.4 mini k=8 at 79.2% with ~1.2x baseline cost; GPT-5.4 nano k=8 at 71.4% with ~0.4x baseline cost).
Related Articles

Black Hat USA
AI Business

Black Hat Asia
AI Business
The AI Hype Cycle Is Lying to You About What to Learn
Dev.to
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to
OpenAI Codex April 2026 Update Review: Computer Use, Memory & 90+ Plugins — Is the Hype Real?
Dev.to