AI benchmarks are broken. Here’s what we need instead.

MIT Technology Review / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article argues that traditional AI benchmarking is flawed because it largely frames evaluation as AI outperforming individual human performance on isolated tasks.
It explains that this “AI vs. human on single problems” approach can be seductive but fails to capture real-world capabilities and constraints needed for practical deployment.
The piece calls for alternative evaluation approaches that better reflect how AI systems are used, including broader context, robustness, and application-oriented success criteria.
It suggests that benchmarks should be redesigned or supplemented to measure what matters for end users and systems in the field, rather than narrow performance comparisons.

For decades, artificial intelligence has been evaluated through the question of whether machines outperform humans. From chess to advanced math, from coding to essay writing, the performance of AI models and applications is tested against that of individual humans completing tasks. This framing is seductive: An AI vs. human comparison on isolated problems with clear…