Exploring Cultural Variations in Moral Judgments with Large Language Models

arXiv cs.AI / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study tests whether large language models reflect culturally diverse moral judgments reported by the World Values Survey (WVS) and Pew’s Global Attitudes Survey (PEW).
  • Researchers compute log-probability-based “moral justifiability” scores and correlate model outputs with survey results across many ethical topics, comparing both smaller monolingual/multilingual models and newer instruction-tuned models.
  • Earlier or smaller models often show near-zero or negative correlation with human moral judgments, while advanced instruction-tuned models show substantially higher positive correlations.
  • The analysis finds stronger alignment with W.E.I.R.D. (Western, Educated, Industrialized, Rich, Democratic) nations than with other regions, indicating uneven cross-cultural sensitivity.
  • The paper discusses remaining challenges for specific topics and regions and relates the findings to bias, training-data diversity, and implications for information retrieval and improving cultural sensitivity.

Abstract

Large Language Models (LLMs) have shown strong performance across many tasks, but their ability to capture culturally diverse moral values remains unclear. In this paper, we examine whether LLMs mirror variations in moral attitudes reported by the World Values Survey (WVS) and the Pew Research Center's Global Attitudes Survey (PEW). We compare smaller monolingual and multilingual models (GPT-2, OPT, BLOOMZ, and Qwen) with recent instruction-tuned models (GPT-4o, GPT-4o-mini, Gemma-2-9b-it, and Llama-3.3-70B-Instruct). Using log-probability-based \emph{moral justifiability} scores, we correlate each model's outputs with survey data covering a broad set of ethical topics. Our results show that many earlier or smaller models often produce near-zero or negative correlations with human judgments. In contrast, advanced instruction-tuned models achieve substantially higher positive correlations, suggesting they better reflect real-world moral attitudes. We provide a detailed regional analysis revealing that models align better with Western, Educated, Industrialized, Rich, and Democratic (W.E.I.R.D.) nations than with other regions. While scaling model size and using instruction tuning improves alignment with cross-cultural moral norms, challenges remain for certain topics and regions. We discuss these findings in relation to bias analysis, training data diversity, information retrieval implications, and strategies for improving the cultural sensitivity of LLMs.