Exploring Cultural Variations in Moral Judgments with Large Language Models

arXiv cs.AI / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study tests whether large language models reflect culturally diverse moral judgments reported by the World Values Survey (WVS) and Pew’s Global Attitudes Survey (PEW).
Researchers compute log-probability-based “moral justifiability” scores and correlate model outputs with survey results across many ethical topics, comparing both smaller monolingual/multilingual models and newer instruction-tuned models.
Earlier or smaller models often show near-zero or negative correlation with human moral judgments, while advanced instruction-tuned models show substantially higher positive correlations.
The analysis finds stronger alignment with W.E.I.R.D. (Western, Educated, Industrialized, Rich, Democratic) nations than with other regions, indicating uneven cross-cultural sensitivity.
The paper discusses remaining challenges for specific topics and regions and relates the findings to bias, training-data diversity, and implications for information retrieval and improving cultural sensitivity.

Abstract

Large Language Models (LLMs) have shown strong performance across many tasks, but their ability to capture culturally diverse moral values remains unclear. In this paper, we examine whether LLMs mirror variations in moral attitudes reported by the World Values Survey (WVS) and the Pew Research Center's Global Attitudes Survey (PEW). We compare smaller monolingual and multilingual models (GPT-2, OPT, BLOOMZ, and Qwen) with recent instruction-tuned models (GPT-4o, GPT-4o-mini, Gemma-2-9b-it, and Llama-3.3-70B-Instruct). Using log-probability-based \emph{moral justifiability} scores, we correlate each model's outputs with survey data covering a broad set of ethical topics. Our results show that many earlier or smaller models often produce near-zero or negative correlations with human judgments. In contrast, advanced instruction-tuned models achieve substantially higher positive correlations, suggesting they better reflect real-world moral attitudes. We provide a detailed regional analysis revealing that models align better with Western, Educated, Industrialized, Rich, and Democratic (W.E.I.R.D.) nations than with other regions. While scaling model size and using instruction tuning improves alignment with cross-cultural moral norms, challenges remain for certain topics and regions. We discuss these findings in relation to bias analysis, training data diversity, information retrieval implications, and strategies for improving the cultural sensitivity of LLMs.

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Dev.to

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

Dev.to

Exploring Cultural Variations in Moral Judgments with Large Language Models

Key Points

Abstract

Related Articles

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer