DIA-HARM: Dialectal Disparities in Harmful Content Detection Across 50 English Dialects

arXiv cs.CL / 2026/4/8

📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • The paper introduces DIA-HARM, a benchmark to evaluate disinformation/harmful-content detectors across 50 English dialects rather than only Standard American English.
  • It releases the D3 corpus (195K samples) built via linguistically grounded transformations from established disinformation benchmarks, enabling dialect-robust testing.
  • Testing 16 detection models finds systematic weaknesses: human-written dialectal content lowers F1 by 1.4–3.6%, while AI-generated content stays comparatively stable.
  • Fine-tuned transformers outperform zero-shot LLM approaches (best-case F1 96.6% vs. 78.3%), and some models suffer catastrophic degradation (>33%) especially on mixed content.
  • Cross-dialect transfer results show multilingual models (e.g., mDeBERTa average F1 97.2%) generalize well, whereas monolingual models (RoBERTa, XLM-RoBERTa) fail more on dialectal inputs, highlighting potential unfair disadvantage for non-SAE speakers.

Abstract

Harmful content detectors-particularly disinformation classifiers-are predominantly developed and evaluated on Standard American English (SAE), leaving their robustness to dialectal variation unexplored. We present DIA-HARM, the first benchmark for evaluating disinformation detection robustness across 50 English dialects spanning U.S., British, African, Caribbean, and Asia-Pacific varieties. Using Multi-VALUE's linguistically grounded transformations, we introduce D3 (Dialectal Disinformation Detection), a corpus of 195K samples derived from established disinformation benchmarks. Our evaluation of 16 detection models reveals systematic vulnerabilities: human-written dialectal content degrades detection by 1.4-3.6% F1, while AI-generated content remains stable. Fine-tuned transformers substantially outperform zero-shot LLMs (96.6% vs. 78.3% best-case F1), with some models exhibiting catastrophic failures exceeding 33% degradation on mixed content. Cross-dialectal transfer analysis across 2,450 dialect pairs shows that multilingual models (mDeBERTa: 97.2% average F1) generalize effectively, while monolingual models like RoBERTa and XLM-RoBERTa fail on dialectal inputs. These findings demonstrate that current disinformation detectors may systematically disadvantage hundreds of millions of non-SAE speakers worldwide. We release the DIA-HARM framework, D3 corpus, and evaluation tools: https://github.com/jsl5710/dia-harm