The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models

arXiv cs.CL / 4/23/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The GaoYao Benchmark is introduced to address key shortcomings in existing LLM evaluation: fragmented cultural dimensions, limited language coverage from low-quality machine translation, and lack of diagnostic depth beyond rankings.
  • The benchmark covers 182.3k samples across 26 languages and 51 countries/areas, organizing tasks into three cultural layers (General Multilingual, Cross-cultural, Monocultural) and nine cognitive sub-layers.
  • The authors expand coverage to “native-quality” localized subjective tasks in 19 languages using expert localization, and they build cross-cultural test sets for 34 cultures, improving prior language/culture coverage by as much as 111%.
  • An in-depth diagnostic study evaluates 20+ flagship and compact LLMs, revealing substantial geographical performance disparities and distinct capability gaps across task types.
  • The GaoYao dataset and benchmark are released publicly on GitHub to support more reliable, culturally grounded multilingual LLM research and development.

Abstract

Evaluating the multilingual and multicultural capabilities of Large Language Models (LLMs) is essential for their global utility. However, current benchmarks face three critical limitations: (1) fragmented evaluation dimensions that often neglect deep cultural nuances; (2) insufficient language coverage in subjective tasks relying on low-quality machine translation; and (3) shallow analysis that lacks diagnostic depth beyond simple rankings. To address these, we introduce GaoYao, a comprehensive benchmark with 182.3k samples, 26 languages and 51 nations/areas. First, GaoYao proposes a unified framework categorizing evaluation tasks into three cultural layers (General Multilingual, Cross-cultural, Monocultural) and nine cognitive sub-layers. Second, we achieve native-quality expansion by leveraging experts to rigorously localize subjective benchmarks into 19 languages and synthesizing cross-cultural test sets for 34 cultures, surpassing prior coverage by up to 111%. Third, we conduct an in-depth diagnostic analysis on 20+ flagship and compact LLMs. Our findings reveal significant geographical performance disparities and distinct gaps between tasks, offering a reliable map for future work. We release the benchmark (https://github.com/lunyiliu/GaoYao).