Cross-Lingual Response Consistency in Large Language Models: An ILR-Informed Evaluation of Claude Across Six Languages
arXiv cs.CL / 5/1/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes an ILR-informed evaluation framework for assessing cross-lingual response consistency in large language models, using ILR Skill Level Descriptions as a grounding rubric.
- It evaluates Claude (Sonnet 4.6) across six languages (English, French, Romanian, Spanish, Italian, German) with 12 semantically equivalent prompt clusters covering ILR complexity levels 1 to 3+.
- The study collects 216 responses and analyzes them with a two-layer approach that combines automated quantitative metrics and expert ILR qualitative assessments.
- Quantitative results show measurable cross-lingual divergence—e.g., French responses are about 30% longer than German on the same prompts—and creative/affective prompts exhibit the greatest surface differences.
- Qualitative assessment identifies five recurring variation patterns (pragmatic disambiguation, creative/aesthetic divergence, technical terminology norms, cultural calibration gaps, and institutional referral behavior in emotional support), arguing this method improves understanding of multilingual equity beyond computational benchmarks.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER