Lost in Translation: Do LVLM Judges Generalize Across Languages?
arXiv cs.CL / 4/22/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper highlights that automatic evaluators (reward models) for large vision-language models are largely tested on English-centric benchmarks, leaving cross-language generalization largely unknown.
- It introduces MM-JudgeBench, a multilingual and multimodal benchmark with 60K+ pairwise preference instances across 25 typologically diverse languages, covering both general LV preference evaluation and chart-centric visual-text reasoning.
- The authors also release a multilingual training set derived from MM-RewardBench (kept disjoint from the evaluation data) to enable domain adaptation.
- Evaluating 22 LVLM judges (15 open-source and 7 proprietary) reveals significant variance in cross-lingual performance and shows that model size and architecture poorly predict multilingual robustness.
- The results suggest that even state-of-the-art LVLM judges behave inconsistently across languages, exposing limitations of current reward modeling and motivating multilingual benchmarks.
Related Articles
Autoencoders and Representation Learning in Vision
Dev.to
Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks
Dev.to

Now Meta will track what employees do on their computers to train its AI agents
The Verge
Context Bloat in AI Agents
Dev.to

We open sourced the AI dev team that builds our product
Dev.to