Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection

arXiv cs.CL / 4/9/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a context-aware, steerable framework for dialectal Arabic machine translation that explicitly models regional and sociolinguistic variation rather than defaulting to Modern Standard Arabic (MSA).
  • It contributes a Rule-Based Data Augmentation (RBDA) pipeline that expands a 3,000-sentence seed corpus into a balanced 57,000-sentence parallel dataset covering eight dialect regions (e.g., Egyptian, Levantine, Gulf).
  • The approach fine-tunes an mT5-base model conditioned on lightweight metadata tags to enable controllable translation across dialects and social registers.
  • Results show an accuracy–fidelity trade-off: higher BLEU from strong baselines like NLLB (aggregating toward MSA) comes at the cost of reduced dialect specificity, while the proposed model achieves more dialect-aligned outputs with lower BLEU.
  • The authors argue that standard MT metrics may not reflect dialect-sensitive quality well and propose LLM-assisted cultural authenticity evaluation as supporting evidence of improved dialect alignment.

Abstract

Current Machine Translation (MT) systems for Arabic often struggle to account for dialectal diversity, frequently homogenizing dialectal inputs into Modern Standard Arabic (MSA) and offering limited user control over the target vernacular. In this work, we propose a context-aware and steerable framework for dialectal Arabic MT that explicitly models regional and sociolinguistic variation. Our primary technical contribution is a Rule-Based Data Augmentation (RBDA) pipeline that expands a 3,000-sentence seed corpus into a balanced 57,000-sentence parallel dataset, covering eight regional varieties eg., Egyptian, Levantine, Gulf, etc. By fine-tuning an mT5-base model conditioned on lightweight metadata tags, our approach enables controllable generation across dialects and social registers in the translation output. Through a combination of automatic evaluation and qualitative analysis, we observe an apparent accuracy-fidelity trade-off: high-resource baselines such as NLLB (No Language Left Behind) achieve higher aggregate BLEU scores (13.75) by defaulting toward the MSA mean, while exhibiting limited dialectal specificity. In contrast, our model achieves lower BLEU scores (8.19) but produces outputs that align more closely with the intended regional varieties. Supporting qualitative evaluation, including an LLM-assisted cultural authenticity analysis, suggests improved dialectal alignment compared to baseline systems (4.80/5 vs. 1.0/5). These findings highlight the limitations of standard MT metrics for dialect-sensitive tasks and motivate the need for evaluation practices that better reflect linguistic diversity in Arabic MT.