Position: Science of AI Evaluation Requires Item-level Benchmark Data

arXiv cs.AI / 4/7/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper argues that today’s AI evaluation methods often suffer from systemic validity failures, including flawed benchmark design choices and poorly aligned metrics.
It presents a case that collecting item-level benchmark data is necessary to build a more rigorous “science of AI evaluation,” enabling fine-grained diagnostics and principled benchmark validation.
The authors review evaluation paradigms by drawing connections between computer science evaluation practice and psychometrics, showing how item-level evidence can expose underlying issues.
They introduce OpenEval, a growing repository intended to support community adoption of evidence-centered, item-level AI evaluation workflows.

Abstract

AI evaluations have become the primary evidence for deploying generative AI systems across high-stakes domains. However, current evaluation paradigms often exhibit systemic validity failures. These issues, ranging from unjustified design choices to misaligned metrics, remain intractable without a principled framework for gathering validity evidence and conducting granular diagnostic analysis. In this position paper, we argue that item-level AI benchmark data is essential for establishing a rigorous science of AI evaluation. Item-level analysis enables fine-grained diagnostics and principled validation of benchmarks. We substantiate this position by dissecting current validity failures and revisiting evaluation paradigms across computer science and psychometrics. Through illustrative analyses of item properties and latent constructs, we demonstrate the unique insights afforded by item-level data. To catalyze community-wide adoption, we introduce OpenEval, a growing repository of item-level benchmark data designed supporting evidence-centered AI evaluation.