BenchBrowser -- Collecting Evidence for Evaluating Benchmark Validity
arXiv cs.AI / 3/20/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper argues that high-level benchmark metadata is too coarse to verify whether benchmarks test the capabilities practitioners actually care about.
- It introduces BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases across more than 20 benchmark suites.
- A human study validates BenchBrowser's retrieval precision, supporting its use for diagnosing content validity gaps and low convergent validity.
- BenchBrowser provides a way to quantify and diagnose the gap between practitioner intent and what benchmarks actually test.