Adaptive Rigor in AI System Evaluation using Temperature-Controlled Verdict Aggregation via Generalized Power Mean

arXiv cs.AI / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that common LLM evaluation approaches (e.g., LLM-as-a-judge, verdict systems, and NLI) may misalign with human judgments because they use fixed strictness across domains.
  • It introduces Temperature-Controlled Verdict Aggregation (TCVA), which aggregates a five-level verdict score using generalized power-mean pooling and a temperature parameter T (0.1–1.0) to tune evaluation rigor.
  • The authors position low temperatures as producing pessimistic, safety-critical-friendly scoring, while higher temperatures yield more lenient assessments for conversational or user-facing settings.
  • Experiments on SummEval and USR with human Likert annotations show TCVA correlates with human judgments at a level comparable to RAGAS for faithfulness (Spearman 0.667 vs. 0.676) and outperforms DeepEval.
  • A key efficiency claim is that TCVA can adjust the temperature parameter without requiring additional LLM calls, reducing evaluation cost when tuning strictness.

Abstract

Existing evaluation methods for LLM-based AI systems, such as LLM-as-a-Judge, verdict systems, and NLI, do not always align well with human assessment because they cannot adapt their strictness to the application domain. This paper presents Temperature-Controlled Verdict Aggregation (TCVA), a method that combines a five-level verdict-scoring system with generalized power-mean aggregation and an intuitive temperature parameter T [0.1, 1.0] to control evaluation rigor. Low temperatures yield pessimistic scores suited for safety-critical domains; high temperatures produce lenient scores appropriate for conversational AI. Experimental evaluation on three benchmark datasets with human Likert-scale annotations (SummEval and USR) shows that TCVA achieves correlation with human judgments comparable to RAGAS on faithfulness (Spearman = 0.667 vs. 0.676) while consistently outperforming DeepEval. The method requires no additional LLM calls when adjusting the temperature parameter.