Toward Culturally Grounded Natural Language Processing

arXiv cs.CL / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that multilingual NLP progress does not automatically imply cultural competence, noting that multilingual capability and cultural understanding can diverge.
  • It synthesizes 50+ papers (2020–2026) showing that performance inequality across languages is driven not only by training data coverage but also by factors like tokenization, prompt language, translated benchmark design, culturally specific supervision, and multimodal context.
  • It highlights multiple benchmark and dataset efforts and critiques (e.g., Global-MMLU, CDEval, WorldValuesBench, CulturalBench, CULEMO, CulturalVQA) that demonstrate strong models can still flatten local norms or misread culturally grounded cues.
  • The authors call for moving beyond treating languages as separate benchmark rows toward modeling “communicative ecologies,” including institutions, scripts, translation pipelines, domains, modalities, and communities.
  • The article proposes a culturally grounded NLP research agenda emphasizing richer contextual metadata, culturally stratified evaluation, participatory alignment, within-language variation, and multimodal, community-aware design.

Abstract

Recent progress in multilingual NLP is often taken as evidence of broader global inclusivity, but a growing literature shows that multilingual capability and cultural competence come apart. This paper synthesizes over 50 papers from 2020--2026 spanning multilingual performance inequality, cross-lingual transfer, culture-aware evaluation, cultural alignment, multimodal local-knowledge modeling, benchmark design critiques, and community-grounded data practices. Across this literature, training data coverage remains a strong determinant of performance, yet it is not sufficient: tokenization, prompt language, translated benchmark design, culturally specific supervision, and multimodal context all materially affect outcomes. Recent work on Global-MMLU, CDEval, WorldValuesBench, CulturalBench, CULEMO, CulturalVQA, GIMMICK, DRISHTIKON, WorldCuisines, CARE, CLCA, and newer critiques of benchmark design and community-grounded evaluation shows that strong multilingual models can still flatten local norms, misread culturally grounded cues, and underperform in lower-resource or community-specific settings. We argue that the field should move from treating languages as isolated rows in a benchmark spreadsheet toward modeling communicative ecologies: the institutions, scripts, translation pipelines, domains, modalities, and communities through which language is used. On that basis, we propose a research agenda for culturally grounded NLP centered on richer contextual metadata, culturally stratified evaluation, participatory alignment, within-language variation, and multimodal community-aware design.