Cultural Authenticity: Comparing LLM Cultural Representations to Native Human Expectations

arXiv cs.CL / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that evaluating LLM cultural outputs using only diversity and factual accuracy is insufficient, and proposes measuring cultural alignment with how native populations prioritize cultural facets.
  • It introduces a human-centered evaluation framework using “Cultural Importance Vectors” derived from open-ended surveys across nine countries to create a ground-truth baseline of what matters culturally.
  • It further defines “Cultural Representation Vectors” computed from model outputs generated via a syntactically diversified prompt set, and tests three frontier LLMs (Gemini 2.5 Pro, GPT-4o, and Claude 3.5 Haiku).
  • Results suggest a Western-centric calibration for some models, with alignment declining as a country’s cultural distance from the US increases.
  • The study also finds consistent systemic error patterns across models, indicating that outputs may overemphasize certain cultural markers while missing deeper social and value-based priorities.

Abstract

Cultural representation in Large Language Model (LLM) outputs has primarily been evaluated through the proxies of cultural diversity and factual accuracy. However, a crucial gap remains in assessing cultural alignment: the degree to which generated content mirrors how native populations perceive and prioritize their own cultural facets. In this paper, we introduce a human-centered framework to evaluate the alignment of LLM generations with local expectations. First, we establish a human-derived ground-truth baseline of importance vectors, called Cultural Importance Vectors based on an induced set of culturally significant facets from open-ended survey responses collected across nine countries. Next, we introduce a method to compute model-derived Cultural Representation Vectors of an LLM based on a syntactically diversified prompt-set and apply it to three frontier LLMs (Gemini 2.5 Pro, GPT-4o, and Claude 3.5 Haiku). Our investigation of the alignment between the human-derived Cultural Importance and model-derived Cultural Representations reveals a Western-centric calibration for some of the models where alignment decreases as a country's cultural distance from the US increases. Furthermore, we identify highly correlated, systemic error signatures (\rho > 0.97) across all models, which over-index on some cultural markers while neglecting the deep-seated social and value-based priorities of users. Our approach moves beyond simple diversity metrics toward evaluating the fidelity of AI-generated content in authentically capturing the nuanced hierarchies of global cultures.