Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

arXiv cs.CL / 4/23/2026

📰 NewsModels & Research

共有:

Key Points

The paper studies how prompt design and the choice of “judge” LLM affect LLM-as-a-Judge evaluations for free-text legal QA on the LEXam benchmark.
Using the ProTeGi method with feedback from two judges (Qwen3-32B and DeepSeek-V3) across four task models, automatic prompt optimization beats human-centered baseline prompts consistently.
Lenient judge feedback produces larger, more consistent improvements than strict judge feedback, and prompts optimized with lenient feedback transfer better to strict judges.
The analysis suggests lenient judges give more permissive feedback that yields broadly applicable prompts, while strict judges drive restrictive, judge-specific overfitting.
The authors conclude that algorithmically optimizing prompts on training data can outperform manual prompt design, and that judge disposition critically influences generalizability, with code and optimized prompts released on GitHub.

Abstract

This work explores the role of prompt design and judge selection in LLM-as-a-Judge evaluations of free text legal question answering. We examine whether automatic task prompt optimization improves over human-centered design, whether optimization effectiveness varies by judge feedback style, and whether optimized prompts transfer across judges. We systematically address these questions on the LEXam benchmark by optimizing task prompts using the ProTeGi method with feedback from two judges (Qwen3-32B, DeepSeek-V3) across four task models, and then testing cross-judge transfer. Automatic optimization consistently outperforms the baseline, with lenient judge feedback yielding higher and more consistent gains than strict judge feedback. Prompts optimized with lenient feedback transfer better to strict judges than the reverse direction. Analysis reveals that lenient judges provide permissive feedback, yielding prompts with broader applicability, whereas strict judges produce restrictive feedback, leading to judge-specific overfitting. Our findings demonstrate algorithmically optimizing prompts on training data can outperform human-centered prompt design and that judges' dispositions during optimization shape prompt generalizability. Code and optimized prompts are available at https://github.com/TUMLegalTech/icail2026-llm-judge-gaming.

Why don't Automatic speech Recognition models use prompting? [D]

Reddit r/MachineLearning

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Dev.to

My AI Agent Over-Corrected Itself — So I Built Metabolic Regulation

Dev.to

Building Agent Arena: Using Valkey as the Nervous System for Multi-Agent AI

Dev.to

IP-Adapter + LoRA for product catalog rendering — putting shop items on AI characters

Dev.to

Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

Key Points

Abstract

Related Articles

Why don't Automatic speech Recognition models use prompting? [D]

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

My AI Agent Over-Corrected Itself — So I Built Metabolic Regulation

Building Agent Arena: Using Valkey as the Nervous System for Multi-Agent AI

IP-Adapter + LoRA for product catalog rendering — putting shop items on AI characters

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer