IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

arXiv cs.CL / 3/26/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces IslamicMMLU, a new benchmark with 10,013 multiple-choice questions to evaluate LLMs on Islamic knowledge across Quran, Hadith, and Fiqh.
The benchmark is organized into three tracks with multiple question types per track, enabling assessment of different reasoning and knowledge-handling capabilities.
Initial evaluation of 26 LLMs shows large performance variance across models, with overall averaged accuracy ranging from 39.8% to 93.8% and the Quran track exhibiting the widest spread.
A Fiqh component includes a new madhab (school of jurisprudence) bias detection task to measure differing model preferences across schools of thought.
The authors release the evaluation code and a public leaderboard, including findings that Arabic-specific models are inconsistent and generally underperform frontier models.

Abstract

Large language models are increasingly consulted for Islamic knowledge, yet no comprehensive benchmark evaluates their performance across core Islamic disciplines. We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions). Each track is formed of multiple types of questions to examine LLMs capabilities handling different aspects of Islamic knowledge. The benchmark is used to create the IslamicMMLU public leaderboard for evaluating LLMs, and we initially evaluate 26 LLMs, where their averaged accuracy across the three tracks varied between 39.8\% to 93.8\% (by Gemini 3 Flash). The Quran track shows the widest span (99.3\% to 32.4\%), while the Fiqh track includes a novel madhab (Islamic school of jurisprudence) bias detection task revealing variable school-of-thought preferences across models. Arabic-specific models show mixed results, but they all underperform compared to frontier models. The evaluation code and leaderboard are made publicly available.

Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.

Mistral AI Blog

Anyone who has any common sense knows that AI agents in marketing just don’t exist.

Dev.to

How to Use MiMo V2 API for Free in 2026: Complete Guide

Dev.to

The Agent Memory Problem Nobody Solves: A Practical Architecture for Persistent Context

Dev.to

From Chaos to Compliance: AI Automation for the Mobile Kitchen

Dev.to

IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

Key Points

Abstract

Related Articles

Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.

Anyone who has any common sense knows that AI agents in marketing just don’t exist.

How to Use MiMo V2 API for Free in 2026: Complete Guide

The Agent Memory Problem Nobody Solves: A Practical Architecture for Persistent Context

From Chaos to Compliance: AI Automation for the Mobile Kitchen

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer