Position: Science of AI Evaluation Requires Item-level Benchmark Data
arXiv cs.AI / 4/7/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper argues that today’s AI evaluation methods often suffer from systemic validity failures, including flawed benchmark design choices and poorly aligned metrics.
- It presents a case that collecting item-level benchmark data is necessary to build a more rigorous “science of AI evaluation,” enabling fine-grained diagnostics and principled benchmark validation.
- The authors review evaluation paradigms by drawing connections between computer science evaluation practice and psychometrics, showing how item-level evidence can expose underlying issues.
- They introduce OpenEval, a growing repository intended to support community adoption of evidence-centered, item-level AI evaluation workflows.




