Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge

arXiv cs.CL / 3/6/2026

Ideas & Deep AnalysisModels & Research

共有:

Key Points

The paper evaluates score stability of LLM-as-a-judge across five popular models using real enterprise RAG QA pairs.
It finds substantial score variability across repeated runs, with “completeness” scoring fluctuating the most, even at temperature=0.
Cross-model comparisons show systematic differences in strictness and interpretation, producing divergent scores for identical inputs.
Lower temperature improves stability for some models (notably GPT-4o and Gemini) but has limited or inconsistent effects for Anthropic models.
The results warn that production workflows using LLM scores for routing, gating, or QC face risks to fairness, reproducibility, and operational reliability, motivating monitoring and hybrid human-LLM evaluation.

Continue reading this article on the original site.

Dev.to

Dev.to

Dev.to

Dev.to

Dev.to