AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance
arXiv cs.AI / 4/15/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces AISafetyBenchExplorer, a structured catalogue covering 195 AI safety benchmarks published from 2018–2026, with metadata at the benchmark, metric, and repository levels.
- The authors use the catalogue to show that benchmark proliferation has outpaced standardization, leading to fragmentation in how LLM safety is operationalized and judged.
- The landscape is reported as uneven, with many medium-complexity benchmarks but only a small number in a “Popular” tier, alongside strong skew toward English-only evaluation.
- The study finds frequent governance and durability issues, including many stale GitHub repositories and Hugging Face datasets, suggesting weak post-publication stewardship.
- At the metric level, common labels (e.g., accuracy/F1/safety score) often hide substantively different judges, aggregation rules, and threat models, limiting comparability across studies.
Related Articles

Black Hat Asia
AI Business
The Complete Guide to Better Meeting Productivity with AI Note-Taking
Dev.to
5 Ways Real-Time AI Can Boost Your Sales Call Performance
Dev.to

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG
Dev.to
Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]
Reddit r/MachineLearning