Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation
arXiv cs.CL / 4/7/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper introduces QIMMA, a quality-assured Arabic LLM leaderboard that validates benchmark quality as a first-class step rather than relying on existing benchmarks unchanged.
- QIMMA uses a multi-model evaluation pipeline that combines automated LLM judgment with human review to identify and fix systematic issues in established Arabic benchmark data.
- The resulting evaluation suite covers multiple domains and tasks with over 52k samples, grounded mainly in native Arabic content (with code tasks treated as language-agnostic).
- QIMMA emphasizes reproducibility through transparent implementation (LightEval, EvalPlus) and by publicly releasing per-sample inference outputs to support community extension.



