Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation

arXiv cs.AI / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies LLM-based agent judges for evaluating conversational AI by running 960 sessions across two model pairs and 15 tasks, comparing agent-judge outputs against human raters via a Turing-style validation.
  • Results show persona-based agent judges can produce assessments statistically indistinguishable from human evaluations, addressing part of the trust/validity uncertainty.
  • It finds a score–coverage dissociation: quality scores improve logarithmically with panel size while unique issue discoveries follow a sublinear power law, with scoring saturating faster than coverage.
  • The authors hypothesize this scaling behavior reflects a power-law distribution of the “finding space,” where critical issues are found early by small panels and rarer corner cases require larger panels.
  • The mechanism is attributed to ensemble diversity from structured Big Five personality conditioning, with expert judges functioning as adversarial probes; an ablation indicates that structured persona conditioning (not mere prompting) is necessary to reproduce the observed scaling properties.

Abstract

LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed? Through 960 sessions with two model pairs across 15 tasks, we show that persona-based agent judges produce evaluations indistinguishable from human raters in a Turing-style validation. We then identify a score-coverage dissociation: quality scores improve logarithmically with panel size, while unique issue discoveries follow a sublinear power law-both exhibit diminishing returns, but scores saturate roughly twice as fast as discoveries. We hypothesize this reflects a power law distribution of the finding space: critical issues are discovered first by small panels, while corner cases require progressively larger panels, analogous to species accumulation curves in ecology. The mechanism traces to ensemble diversity-Big Five personality conditioning makes agents probe different quality dimensions, with expert judges acting as adversarial probes that push discovery into the tail of the finding distribution. A controlled ablation confirms that structured persona conditioning, not simple prompting, is required to produce these scaling properties.