Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith
arXiv cs.CL / 2026/3/26
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper proposes a retrieval-augmented generation (RAG) framework that grounds Arabic LLM outputs in diachronic lexicographic knowledge from the Doha Historical Dictionary of Arabic (DHDA) to better handle complex historical and religious texts like the Quran and Hadith.
- It uses a hybrid retrieval strategy combined with an intent-based routing mechanism to supply LLMs with precise, contextually relevant evidence drawn specifically from DHDA rather than general-purpose corpora.
- Experiments report improved accuracy for Arabic-native models such as Fanar and ALLaM to over 85%, reducing the performance gap versus the proprietary Gemini model.
- The evaluation approach leverages “LLM-as-a-judge” with Gemini and validates results via human evaluation, showing high agreement (kappa = 0.87).
- The study identifies recurring linguistic challenges—especially diacritics and compound expressions—and releases code and resources publicly for reproducibility.



