Cross-Lingual Response Consistency in Large Language Models: An ILR-Informed Evaluation of Claude Across Six Languages

arXiv cs.CL / 5/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes an ILR-informed evaluation framework for assessing cross-lingual response consistency in large language models, using ILR Skill Level Descriptions as a grounding rubric.
  • It evaluates Claude (Sonnet 4.6) across six languages (English, French, Romanian, Spanish, Italian, German) with 12 semantically equivalent prompt clusters covering ILR complexity levels 1 to 3+.
  • The study collects 216 responses and analyzes them with a two-layer approach that combines automated quantitative metrics and expert ILR qualitative assessments.
  • Quantitative results show measurable cross-lingual divergence—e.g., French responses are about 30% longer than German on the same prompts—and creative/affective prompts exhibit the greatest surface differences.
  • Qualitative assessment identifies five recurring variation patterns (pragmatic disambiguation, creative/aesthetic divergence, technical terminology norms, cultural calibration gaps, and institutional referral behavior in emotional support), arguing this method improves understanding of multilingual equity beyond computational benchmarks.

Abstract

This paper introduces a systematic evaluation framework grounded in the Interagency Language Roundtable (ILR) Skill Level Descriptions and applies it to Claude (Sonnet 4.6) across six languages: English, French, Romanian, Spanish, Italian, and German. We administer a battery of 12 semantically equivalent prompt clusters spanning ILR complexity levels 1 through 3+, collect 216 responses (12 prompts, 6 languages, 3 runs), and analyze outputs through a two-layer methodology combining automated quantitative metrics with expert ILR qualitative assessment. Quantitative analysis reveals that French responses are approximately 30% longer than German responses on identical prompts, and that creative and affective clusters show the highest cross-lingual surface divergence. Qualitative analysis, conducted by a six-language professional with 12 years of ILR/OPI assessment experience, identifies five cross-lingual variation patterns: systematic differences in pragmatic disambiguation strategies, aesthetic and literary tradition divergence in creative output, language-internal technical terminology norms, cultural calibration gaps evidenced by the absence of culture-specific content in favor of culturally neutralized templates, and language-specific institutional referral behavior in emotional support responses. We argue that ILR-informed expert judgment applied to LLM outputs constitutes a novel and underreported evaluation methodology that complements purely computational benchmarks, and that cross-lingual output variation in Claude is interpretable, domain-dependent, and consequential for equitable multilingual AI deployment.