BenchScope: How Many Independent Signals Does Your Benchmark Provide?

arXiv cs.AI / 4/1/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces Effective Dimensionality (ED), a fast diagnostic based on the participation ratio of a centered benchmark-score spectrum, to estimate how much independent information a benchmark’s reported scores actually contain.
Applying ED to 22 benchmarks across 8 domains and over 8,400 model evaluations finds strong redundancy in current evaluation suites, such as Open LLM Leaderboard’s six scores behaving like ~two effective axes (ED = 1.7).
It reports that BBH and MMLU-Pro are highly interchangeable (correlation rho = 0.96) with stability across multiple subpopulations, while measurement breadth across benchmarks varies by more than 20x.
The authors show ED rankings are stable under matched-dimension controls and use ED to flag redundant benchmark components, monitor performance-conditional compression, and support ongoing benchmark maintenance.
The paper provides a 22-benchmark reference atlas and a four-step workflow for maintainers, and cautions that ED is a screening statistic (not a literal latent factor count) supported by null/reliability/saturation analyses.

Abstract

AI evaluation suites often report many scores without checking whether those scores carry independent information. We introduce Effective Dimensionality (ED), the participation ratio of a centered benchmark-score spectrum, as a fast, population-conditional upper-bound diagnostic of measurement breadth. Applied at per-instance granularity to 22 benchmarks across 8 domains and more than 8,400 model evaluations, ED reveals substantial redundancy: the six-score Open LLM Leaderboard behaves like roughly two effective measurement axes (ED = 1.7), BBH and MMLU-Pro are near-interchangeable (rho = 0.96, stable across seven subpopulations), and measurement breadth varies more than 20x across current benchmarks. We show that relative ED rankings are stable under matched-dimension controls and that ED can flag redundant suite components, monitor performance-conditional compression, and guide benchmark maintenance. Because binary spectra overestimate absolute latent dimensionality, we interpret ED as a screening statistic rather than a literal factor count and complement it with null, reliability, and saturation analyses. We provide a 22-benchmark reference atlas and a four-step diagnostic workflow that benchmark maintainers can run with a score matrix and a few lines of code.