AI Navigate

Multi-lingual Functional Evaluation for Large Language Models

arXiv cs.CL / 3/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The authors introduce multilingual functional benchmarks CL-GSM Symbolic and CL-IFEval by translating English benchmark templates into French, Spanish, Hindi, Arabic, and Yoruba to assess practical performance and robustness of LLMs across languages.
  • They compare these benchmarks to static multilingual benchmarks Belebele, M-GSM, and M-MMLU, finding notable performance gaps across languages (e.g., 24%, 17%, and 18% decreases from M-GSM to CL-GSM Symbolic in English, French, and Spanish).
  • They report a 15–24% drop when moving from Belebele to CL-IFEval, and only a 0.5%–3% drop between M-MMLU and CL-IFEval, highlighting how benchmark choice affects measured performance.
  • The results show that model robustness across languages varies significantly, with languages like Arabic and English displaying more consistent performance across evaluation iterations.

Abstract

Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French and Spanish respectively; similarly there's a 15 - 24% performance drop across languages between Belebele and CL-IFEval, and only a 0.5% to 3% performance drop between M-MMLU and CL-IFEval). Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.