Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

arXiv cs.AI / 3/12/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes leveraging Wikipedia content, the Wikidata knowledge graph, and social science expertise to create a dataset of culturally informed Q/A pairs for Latin American contexts.
They construct LatamQA with over 26,000 questions and answers drawn from 26,000 Wikipedia articles, transformed into multiple-choice items in Spanish and Portuguese and translated into English.
They use LatamQA to benchmark several LLMs, finding disparities across LatAm countries, better performance in models' original language, and greater familiarity with Iberian Spanish than Latin American variants.
The work highlights data gaps in non-English LatAm contexts and provides a resource to measure and mitigate sociocultural bias in LLMs.

Abstract

Large Language Models (LLMs) exhibit inequalities with respect to various cultural contexts. Most prominent open-weights models are trained on Global North data and show prejudicial behavior towards other cultures. Moreover, there is a notable lack of resources to detect biases in non-English languages, especially from Latin America (Latam), a continent containing various cultures, even though they share a common cultural ground. We propose to leverage the content of Wikipedia, the structure of the Wikidata knowledge graph, and expert knowledge from social science in order to create a dataset of question/answer (Q/As) pairs, based on the different popular and social cultures of various Latin American countries. We create the LatamQA database of over 26k questions and associated answers extracted from 26k Wikipedia articles, and transformed into multiple-choice questions (MCQ) in Spanish and Portuguese, in turn translated to English. We use this MCQ to quantify the degree of knowledge of various LLMs and find out (i) a discrepancy in performances between the Latam countries, ones being easier than others for the majority of the models, (ii) that the models perform better in their original language, and (iii) that Iberian Spanish culture is better known than Latam one.

ADICはどの種類の革新なのか ―― ドリフト監査デモで見る「事後説明」から「通過条件」への移行**

Qiita

Complete Guide: How To Make Money With Ai

Dev.to

Built a small free iOS app to reduce LLM answer uncertainty with multiple models

Dev.to

Without Valid Data, AI Transformation Is Flying Blind – Why We Need to “Grasp” Work Again

Dev.to

How We Used Hindsight Memory to Build an AI That Knows Your Weaknesses

Dev.to

Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

Key Points

Abstract

Related Articles

ADICはどの種類の革新なのか ―― ドリフト監査デモで見る「事後説明」から「通過条件」への移行**

Complete Guide: How To Make Money With Ai

Built a small free iOS app to reduce LLM answer uncertainty with multiple models

Without Valid Data, AI Transformation Is Flying Blind – Why We Need to “Grasp” Work Again

How We Used Hindsight Memory to Build an AI That Knows Your Weaknesses

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer