Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith
arXiv cs.CL / 3/26/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes a retrieval-augmented generation (RAG) framework that grounds Arabic LLM outputs in diachronic lexicographic knowledge from the Doha Historical Dictionary of Arabic (DHDA) to better handle complex historical and religious texts like the Quran and Hadith.
- It uses a hybrid retrieval strategy combined with an intent-based routing mechanism to supply LLMs with precise, contextually relevant evidence drawn specifically from DHDA rather than general-purpose corpora.
- Experiments report improved accuracy for Arabic-native models such as Fanar and ALLaM to over 85%, reducing the performance gap versus the proprietary Gemini model.
- The evaluation approach leverages “LLM-as-a-judge” with Gemini and validates results via human evaluation, showing high agreement (kappa = 0.87).
- The study identifies recurring linguistic challenges—especially diacritics and compound expressions—and releases code and resources publicly for reproducibility.
Related Articles
Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.
Mistral AI Blog
Why I Switched from Cloud AI to a Dedicated AI Box (And Why You Should Too)
Dev.to
Anyone who has any common sense knows that AI agents in marketing just don’t exist.
Dev.to
How to Use MiMo V2 API for Free in 2026: Complete Guide
Dev.to
The Agent Memory Problem Nobody Solves: A Practical Architecture for Persistent Context
Dev.to