Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge
arXiv cs.CL / 3/6/2026
Ideas & Deep AnalysisModels & Research
Key Points
- The paper evaluates score stability of LLM-as-a-judge across five popular models using real enterprise RAG QA pairs.
- It finds substantial score variability across repeated runs, with “completeness” scoring fluctuating the most, even at temperature=0.
- Cross-model comparisons show systematic differences in strictness and interpretation, producing divergent scores for identical inputs.
- Lower temperature improves stability for some models (notably GPT-4o and Gemini) but has limited or inconsistent effects for Anthropic models.
- The results warn that production workflows using LLM scores for routing, gating, or QC face risks to fairness, reproducibility, and operational reliability, motivating monitoring and hybrid human-LLM evaluation.
Continue reading this article on the original site.
Read original →Related Articles

Building Read-Along AI: Field Notes from a Small-Model Reading Tutor
Dev.to

An AI's Completely Ordinary Day (A True Story)
Dev.to

10 Ways AI is Transforming the Telemedicine Sector in 2026
Dev.to

The 'Security Theater' Trap: Why Your 30-Second AI Code Scan Is Giving You a False Sense of Safety
Dev.to

The Documentation Trap: Why Your 'AI-Readable' Specs Are Actually Harder to Maintain Than Your Code
Dev.to