Lessons Without Borders? Evaluating Cultural Alignment of LLMs Using Multilingual Story Moral Generation

arXiv cs.AI / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a new evaluation task, “multilingual story moral generation,” to measure how well LLMs align with culturally grounded human interpretations of story morals across language-culture pairs.
  • Using a newly created dataset of human-written story morals spanning 14 language-culture pairs, the authors evaluate model outputs against human responses with semantic similarity, human preference surveys, and value categorization.
  • Results show that frontier models like GPT-4o and Gemini produce morals that are semantically similar to human answers and are generally preferred by evaluators.
  • However, the models display reduced cross-linguistic variation, producing morals that cluster around a narrower set of widely shared values rather than the broader diversity found in human narrative understanding.
  • The work frames cultural alignment as an evaluative, narrative-interpretation problem, offering an alternative to static benchmarks or purely knowledge-based tests.

Abstract

Stories are key to transmitting values across cultures, but their interpretation varies across linguistic and cultural contexts. Thus, we introduce multilingual story moral generation as a novel culturally grounded evaluation task. Using a new dataset of human-written story morals collected across 14 language-culture pairs, we compare model outputs with human interpretations via semantic similarity, a human preference survey, and value categorization. We show that frontier models such as GPT-4o and Gemini generate story morals that are semantically similar to human responses and preferred by human evaluators. However, their outputs exhibit markedly less cross-linguistic variation and concentrate on a narrower set of widely shared values. These findings suggest that while contemporary models can approximate central tendencies of human moral interpretation, they struggle to reproduce the diversity that characterizes human narrative understanding. By framing narrative interpretation as an evaluative task, this work introduces a new approach to studying cultural alignment in language models beyond static benchmarks or knowledge-based tests.