Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation

arXiv cs.CL / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper shows that in Agent-as-a-Judge evaluations, the judge language is not a neutral setting: switching languages can meaningfully invert or change backbone rankings.
  • It localizes an Agent-as-a-Judge prompt stack into five typologically diverse languages (English, Arabic, Turkish, Chinese, Hindi) and runs 4,950 judge evaluations across 55 DevAI tasks, three agent frameworks, and six judge backbones.
  • Results indicate a strong interaction between language and judge backbone: GPT-4o performs best in English, while Gemini leads for Arabic and Hindi, and no single backbone dominates across all languages.
  • Agreement between different judge backbones on requirement-level judgments is modest (Fleiss’ κ ≤ 0.231), suggesting substantial variability in how models interpret localized evaluation prompts.
  • An ablation study further finds that localizing judge-side instructions (not just benchmark content) can be decisive, with Hindi satisfaction dropping sharply when localization is only partial, and the paper releases judgments and runtime stats for reproducibility.

Abstract

Evaluation language is typically treated as a fixed English default in agentic code benchmarks, yet we show that changing the judge's language can invert backbone rankings. We localize the Agent-as-a-Judge prompt stack to five typologically diverse languages (English, Arabic, Turkish, Chinese, Hindi) and evaluate 55 DevAI development tasks across three developer-agent frameworks and six judge backbones, totaling 4950 judge runs. The central finding is that backbone and language interact: GPT-4o achieves the highest satisfaction in English (44.72\%), while Gemini leads in Arabic (51.72\%, p<0.001 vs.\ GPT-4o) and Hindi (53.22\%). No single backbone dominates across all languages, and inter-backbone agreement on individual requirement judgments is modest (Fleiss' \kappa \leq 0.231). A controlled ablation further shows that localizing judge-side instructions, not just benchmark content, can be decisive: Hindi satisfaction drops from 42.8\% to 23.2\% under partial localization. These results indicate that language should be treated as an explicit evaluation variable in agentic benchmarks. Full requirement-level judgments and runtime statistics are released for reproducibility.