Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation
arXiv cs.AI / 4/2/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies LLM-based agent judges for evaluating conversational AI by running 960 sessions across two model pairs and 15 tasks, comparing agent-judge outputs against human raters via a Turing-style validation.
- Results show persona-based agent judges can produce assessments statistically indistinguishable from human evaluations, addressing part of the trust/validity uncertainty.
- It finds a score–coverage dissociation: quality scores improve logarithmically with panel size while unique issue discoveries follow a sublinear power law, with scoring saturating faster than coverage.
- The authors hypothesize this scaling behavior reflects a power-law distribution of the “finding space,” where critical issues are found early by small panels and rarer corner cases require larger panels.
- The mechanism is attributed to ensemble diversity from structured Big Five personality conditioning, with expert judges functioning as adversarial probes; an ablation indicates that structured persona conditioning (not mere prompting) is necessary to reproduce the observed scaling properties.
Related Articles

Black Hat Asia
AI Business

Unitree's IPO
ChinaTalk
Did you know your GIGABYTE laptop has a built-in AI coding assistant? Meet GiMATE Coder 🤖
Dev.to
Benchmarking Batch Deep Reinforcement Learning Algorithms
Dev.to
A bug in Bun may have been the root cause of the Claude Code source code leak.
Reddit r/LocalLLaMA