Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues

arXiv cs.AI / 5/4/2026

💬 OpinionSignals & Early TrendsModels & Research

Key Points

  • The article highlights a major evaluation gap: current LLM Arabic benchmarks largely rely on Modern Standard Arabic (MSA) snippets and miss cultural nuances that emerge in real dialogues and dialects.
  • It introduces ArabCulture-Dialogue, a culturally grounded conversational dataset spanning 13 Arabic-speaking countries, with both MSA and local dialects, covering 12 daily-life topics and 54 subtopics.
  • Using this dataset, the authors define three benchmark tasks: multiple-choice cultural reasoning, translation between MSA and dialects, and dialect-steering text generation.
  • The experiments show a consistent performance drop for LLMs in dialectal settings across all three tasks compared with MSA, indicating that models still struggle with dialect- and culture-specific dialog understanding.
  • The work provides a more realistic framework for measuring LLM capabilities in culturally rich, multilingual Arabic conversational contexts.

Abstract

There is a significant gap in evaluating cultural reasoning in LLMs using conversational datasets that capture culturally rich and dialectal contexts. Most Arabic benchmarks focus on short text snippets in Modern Standard Arabic (MSA), overlooking the cultural nuances that naturally arise in dialogues. To address this gap, we introduce ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries, in both MSA and each country's respective dialect, spanning 12 daily-life topics and 54 fine-grained subtopics. We utilize the dataset to form three benchmarking tasks: (i) multiple-choice cultural reasoning, (ii) machine translation between MSA and dialects, and (iii) dialect-steering generation. Our experiments indicate that the performance gap between MSA and Arabic dialects still exists, whereby the models perform worse on all three tasks in the dialectal setup, compared to the MSA one.