Measuring all the noises of LLM Evals

arXiv stat.ML / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that separating signal from noise in LLM experiments requires statistical methods adapted to LLM-specific “noise” behavior in evaluation settings.
It defines and measures three noise components in LLM evals: prediction noise (variance across different generated answers for the same question), data noise (variance from sampling different questions), and their combined total noise using the law of total variance.
It introduces the “all-pairs paired” evaluation method, which runs paired comparisons across all model pairs and decomposes the noise components using millions of question-level predictions across many evals and settings.
Results show each eval has a characteristic, highly predictable total noise level, and that paired prediction noise usually exceeds paired data noise—implying that averaging predictions can materially improve statistical power.
By measuring all noise components together, the approach helps interpret eval outcomes in context and supports more sound empirical decision-making when selecting or comparing LLMs.

Abstract

Separating signal from noise is central to experiments. Applying well-established statistical methods effectively to LLM evals requires consideration of their unique noise characteristics. We clearly define and measure three types of noise: prediction noise from generating different answers on a given question, data noise from sampling questions, and their combined total noise following the law of total variance. To emphasize relative comparisons and gain statistical power, we propose the all-pairs paired method, which applies the paired analysis to all pairs of LLMs and measures all the noise components based on millions of question-level predictions across many evals and settings, revealing clear patterns. First, each eval exhibits a characteristic and highly predictable total noise level across all model pairs. Second, paired prediction noise typically exceeds paired data noise, which means reducing prediction noise by averaging can significantly increase statistical power. By measuring all the noises together, we can assess eval results in context, lowering the barrier of using the best analysis to make sound empirical decisions.