Multi-lingual Functional Evaluation for Large Language Models

arXiv cs.CL / 3/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The authors introduce multilingual functional benchmarks CL-GSM Symbolic and CL-IFEval by translating English benchmark templates into French, Spanish, Hindi, Arabic, and Yoruba to assess practical performance and robustness of LLMs across languages.
They compare these benchmarks to static multilingual benchmarks Belebele, M-GSM, and M-MMLU, finding notable performance gaps across languages (e.g., 24%, 17%, and 18% decreases from M-GSM to CL-GSM Symbolic in English, French, and Spanish).
They report a 15–24% drop when moving from Belebele to CL-IFEval, and only a 0.5%–3% drop between M-MMLU and CL-IFEval, highlighting how benchmark choice affects measured performance.
The results show that model robustness across languages varies significantly, with languages like Arabic and English displaying more consistent performance across evaluation iterations.

Abstract

Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French and Spanish respectively; similarly there's a 15 - 24% performance drop across languages between Belebele and CL-IFEval, and only a 0.5% to 3% performance drop between M-MMLU and CL-IFEval). Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.

The massive shift toward edge computing and local processing

Dev.to

Self-Refining Agents in Spec-Driven Development

Dev.to

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

Dev.to

The Three-Agent Protocol Is Transferable. The Discipline Isn't.

Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

Reddit r/LocalLLaMA

Multi-lingual Functional Evaluation for Large Language Models

Key Points

Abstract

Related Articles

The massive shift toward edge computing and local processing

Self-Refining Agents in Spec-Driven Development

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

The Three-Agent Protocol Is Transferable. The Discipline Isn't.

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer