XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics

arXiv cs.CL / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that simply averaging translation-evaluation metric scores across languages can be misleading due to cross-lingual scoring bias, where equally good translations may receive different scores depending on language.
  • It introduces XQ-MEval, a semi-automatically built dataset for nine translation directions, created by injecting MQM-defined errors into gold translations, filtering with native speakers, and generating pseudo translations with controllable quality.
  • XQ-MEval structures data into source–reference–pseudo-translation triplets to benchmark how well different translation metrics assess quality.
  • Experiments using nine representative metrics find inconsistencies between metric averaging and human judgments, providing empirical evidence of cross-lingual scoring bias.
  • The authors further propose a normalization method based on XQ-MEval to align score distributions across languages, aiming to improve the fairness and reliability of multilingual metric evaluation.

Abstract

Automatic evaluation metrics are essential for building multilingual translation systems. The common practice of evaluating these systems is averaging metric scores across languages, yet this is suspicious since metrics may suffer from cross-lingual scoring bias, where translations of equal quality receive different scores across languages. This problem has not been systematically studied because no benchmark exists that provides parallel-quality instances across languages, and expert annotation is not realistic. In this work, we propose XQ-MEval, a semi-automatically built dataset covering nine translation directions, to benchmark translation metrics. Specifically, we inject MQM-defined errors into gold translations automatically, filter them by native speakers for reliability, and merge errors to generate pseudo translations with controllable quality. These pseudo translations are then paired with corresponding sources and references to form triplets used in assessing the qualities of translation metrics. Using XQ-MEval, our experiments on nine representative metrics reveal the inconsistency between averaging and human judgment and provide the first empirical evidence of cross-lingual scoring bias. Finally, we propose a normalization strategy derived from XQ-MEval that aligns score distributions across languages, improving the fairness and reliability of multilingual metric evaluation.