Measuring all the noises of LLM Evals
arXiv stat.ML / 3/31/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that separating signal from noise in LLM experiments requires statistical methods adapted to LLM-specific “noise” behavior in evaluation settings.
- It defines and measures three noise components in LLM evals: prediction noise (variance across different generated answers for the same question), data noise (variance from sampling different questions), and their combined total noise using the law of total variance.
- It introduces the “all-pairs paired” evaluation method, which runs paired comparisons across all model pairs and decomposes the noise components using millions of question-level predictions across many evals and settings.
- Results show each eval has a characteristic, highly predictable total noise level, and that paired prediction noise usually exceeds paired data noise—implying that averaging predictions can materially improve statistical power.
- By measuring all noise components together, the approach helps interpret eval outcomes in context and supports more sound empirical decision-making when selecting or comparing LLMs.
Related Articles
Why AI agent teams are just hoping their agents behave
Dev.to

Harness as Code: Treating AI Workflows Like Infrastructure
Dev.to

How to Make Claude Code Better at One-Shotting Implementations
Towards Data Science

The Crypto AI Agent Stack That Costs $0/Month to Run
Dev.to

Bag of Freebies for Training Object Detection Neural Networks
Dev.to