Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation

arXiv cs.CL / 4/7/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper introduces QIMMA, a quality-assured Arabic LLM leaderboard that validates benchmark quality as a first-class step rather than relying on existing benchmarks unchanged.
  • QIMMA uses a multi-model evaluation pipeline that combines automated LLM judgment with human review to identify and fix systematic issues in established Arabic benchmark data.
  • The resulting evaluation suite covers multiple domains and tasks with over 52k samples, grounded mainly in native Arabic content (with code tasks treated as language-agnostic).
  • QIMMA emphasizes reproducibility through transparent implementation (LightEval, EvalPlus) and by publicly releasing per-sample inference outputs to support community extension.

Abstract

We present QIMMA, a quality-assured Arabic LLM leaderboard that places systematic benchmark validation at its core. Rather than aggregating existing resources as-is, QIMMA applies a multi-model assessment pipeline combining automated LLM judgment with human review to surface and resolve systematic quality issues in well-established Arabic benchmarks before evaluation. The result is a curated, multi-domain, multi-task evaluation suite of over 52k samples, grounded predominantly in native Arabic content; code evaluation tasks are the sole exception, as they are inherently language-agnostic. Transparent implementation via LightEval, EvalPlus and public release of per-sample inference outputs make QIMMA a reproducible and community-extensible foundation for Arabic NLP evaluation.