Lost in Translation: Do LVLM Judges Generalize Across Languages?

arXiv cs.CL / 4/22/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper highlights that automatic evaluators (reward models) for large vision-language models are largely tested on English-centric benchmarks, leaving cross-language generalization largely unknown.
It introduces MM-JudgeBench, a multilingual and multimodal benchmark with 60K+ pairwise preference instances across 25 typologically diverse languages, covering both general LV preference evaluation and chart-centric visual-text reasoning.
The authors also release a multilingual training set derived from MM-RewardBench (kept disjoint from the evaluation data) to enable domain adaptation.
Evaluating 22 LVLM judges (15 open-source and 7 proprietary) reveals significant variance in cross-lingual performance and shows that model size and architecture poorly predict multilingual robustness.
The results suggest that even state-of-the-art LVLM judges behave inconsistently across languages, exposing limitations of current reward modeling and motivating multilingual benchmarks.

Abstract

Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.

Autoencoders and Representation Learning in Vision

Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Dev.to

Now Meta will track what employees do on their computers to train its AI agents

The Verge

Context Bloat in AI Agents

Dev.to

We open sourced the AI dev team that builds our product

Dev.to

Lost in Translation: Do LVLM Judges Generalize Across Languages?

Key Points

Abstract

Related Articles

Autoencoders and Representation Learning in Vision

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Now Meta will track what employees do on their computers to train its AI agents

Context Bloat in AI Agents

We open sourced the AI dev team that builds our product

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer