Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge
arXiv cs.CL / 3/6/2026
Ideas & Deep AnalysisModels & Research
Key Points
- The paper evaluates score stability of LLM-as-a-judge across five popular models using real enterprise RAG QA pairs.
- It finds substantial score variability across repeated runs, with “completeness” scoring fluctuating the most, even at temperature=0.
- Cross-model comparisons show systematic differences in strictness and interpretation, producing divergent scores for identical inputs.
- Lower temperature improves stability for some models (notably GPT-4o and Gemini) but has limited or inconsistent effects for Anthropic models.
- The results warn that production workflows using LLM scores for routing, gating, or QC face risks to fairness, reproducibility, and operational reliability, motivating monitoring and hybrid human-LLM evaluation.
Continue reading this article on the original site.
Read original →Related Articles
Azure.AI.VoiceLive_1.2.0-beta.1
Azure OpenAI .NET Releases

AI Usage Statistics 2026: The Structural Shift Behind Adoption, Work, and Hiring
Dev.to

AI Automation for Ai For Solo Maritime Logistics Brokers How To Automate Freight Rate Sheet Analysis And Client Spot Quote Ge...
Dev.to
How to Use AI to Write Emails That Actually Get Replies (2026 Guide)
Dev.to

CrewAI vs Traditional Automation: When Do AI Agents Actually Make Sense?
Dev.to