AI benchmarks systematically ignore how humans disagree, Google study finds

THE DECODER / 4/5/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • A Google study argues that common AI benchmark practices using only three to five human raters per example can produce unreliable results because it fails to capture variability in human judgment.
  • The research finds that how teams split their annotation budget across items and raters can matter as much as the total number of annotations collected.
  • The study highlights that benchmark scores may be systematically biased when human disagreement is treated as noise rather than an informative signal.
  • It implies that future benchmark design should account for rater disagreement and uncertainty to improve comparability and robustness across models.

Coloured contour and dot patterns are superimposed on a faceless human bust and symbolize data visualization of human benchmarks.

A Google study finds that the standard three to five human raters per test example often aren't enough for reliable AI benchmarks, and that splitting your annotation budget the right way matters just as much as the budget itself.

The article AI benchmarks systematically ignore how humans disagree, Google study finds appeared first on The Decoder.