Multi-lingual Functional Evaluation for Large Language Models
arXiv cs.CL / 3/13/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The authors introduce multilingual functional benchmarks CL-GSM Symbolic and CL-IFEval by translating English benchmark templates into French, Spanish, Hindi, Arabic, and Yoruba to assess practical performance and robustness of LLMs across languages.
- They compare these benchmarks to static multilingual benchmarks Belebele, M-GSM, and M-MMLU, finding notable performance gaps across languages (e.g., 24%, 17%, and 18% decreases from M-GSM to CL-GSM Symbolic in English, French, and Spanish).
- They report a 15–24% drop when moving from Belebele to CL-IFEval, and only a 0.5%–3% drop between M-MMLU and CL-IFEval, highlighting how benchmark choice affects measured performance.
- The results show that model robustness across languages varies significantly, with languages like Arabic and English displaying more consistent performance across evaluation iterations.
Related Articles
The massive shift toward edge computing and local processing
Dev.to
Self-Refining Agents in Spec-Driven Development
Dev.to
Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs
Dev.to
The Three-Agent Protocol Is Transferable. The Discipline Isn't.
Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop
Reddit r/LocalLLaMA