Benchmarking Bengali Dialectal Bias: A Multi-Stage Framework Integrating RAG-Based Translation and Human-Augmented RLAIF

arXiv cs.CL / 3/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a multi-stage benchmarking framework to quantify LLM performance bias across nine Bengali dialects, addressing a lack of prior measurement approaches for low-resource regional varieties.
  • It uses a RAG-based translation pipeline to create 4,000 dialectal question sets, and validates translation fidelity with an LLM-as-a-judge method that human assessments show is more reliable than legacy translation metrics.
  • The study benchmarks 19 LLMs using RLAIF-style evaluations with multi-judge agreement and human fallback (68,395 evaluations), producing gold-labeled dialectal QA test sets.
  • Results show large, dialect-linked performance drops (e.g., Chittagong scoring 5.44/10 vs Tangail at 7.68/10), and scaling up models does not consistently reduce the bias.
  • The work contributes a validated translation-quality evaluation method, a benchmark dataset, and a Critical Bias Sensitivity (CBS) metric aimed at safety-critical application needs.

Abstract

Large language models (LLMs) frequently exhibit performance biases against regional dialects of low-resource languages. However, frameworks to quantify these disparities remain scarce. We propose a two-phase framework to evaluate dialectal bias in LLM question-answering across nine Bengali dialects. First, we translate and gold-label standard Bengali questions into dialectal variants adopting a retrieval-augmented generation (RAG) pipeline to prepare 4,000 question sets. Since traditional translation quality evaluation metrics fail on unstandardized dialects, we evaluate fidelity using an LLM-as-a-judge, which human correlation confirms outperforms legacy metrics. Second, we benchmark 19 LLMs across these gold-labeled sets, running 68,395 RLAIF evaluations validated through multi-judge agreement and human fallback. Our findings reveal severe performance drops linked to linguistic divergence. For instance, responses to the highly divergent Chittagong dialect score 5.44/10, compared to 7.68/10 for Tangail. Furthermore, increased model scale does not consistently mitigate this bias. We contribute a validated translation quality evaluation method, a rigorous benchmark dataset, and a Critical Bias Sensitivity (CBS) metric for safety-critical applications.