Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge

arXiv cs.CL / 3/6/2026

Ideas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates score stability of LLM-as-a-judge across five popular models using real enterprise RAG QA pairs.
  • It finds substantial score variability across repeated runs, with “completeness” scoring fluctuating the most, even at temperature=0.
  • Cross-model comparisons show systematic differences in strictness and interpretation, producing divergent scores for identical inputs.
  • Lower temperature improves stability for some models (notably GPT-4o and Gemini) but has limited or inconsistent effects for Anthropic models.
  • The results warn that production workflows using LLM scores for routing, gating, or QC face risks to fairness, reproducibility, and operational reliability, motivating monitoring and hybrid human-LLM evaluation.

Continue reading this article on the original site.

Read original →