Multi-lingual Functional Evaluation for Large Language Models
arXiv cs.CL / 3/13/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The authors introduce multilingual functional benchmarks CL-GSM Symbolic and CL-IFEval by translating English benchmark templates into French, Spanish, Hindi, Arabic, and Yoruba to assess practical performance and robustness of LLMs across languages.
- They compare these benchmarks to static multilingual benchmarks Belebele, M-GSM, and M-MMLU, finding notable performance gaps across languages (e.g., 24%, 17%, and 18% decreases from M-GSM to CL-GSM Symbolic in English, French, and Spanish).
- They report a 15–24% drop when moving from Belebele to CL-IFEval, and only a 0.5%–3% drop between M-MMLU and CL-IFEval, highlighting how benchmark choice affects measured performance.
- The results show that model robustness across languages varies significantly, with languages like Arabic and English displaying more consistent performance across evaluation iterations.




