Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains

arXiv cs.CL / 4/21/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper examines whether automatic machine-translation evaluation metrics remain reliable when tested on unseen domains, noting that WMT-trained metrics may not generalize beyond benchmark settings.
  • It introduces CD-ESA, a cross-domain, multi-annotator error-span annotation dataset with 18.8k human annotations across three language pairs, keeping annotators fixed per language pair while evaluating six translation systems across one seen news domain and two unseen technical domains.
  • The results show that metrics can look robust to domain shift at the segment level, but that perceived robustness largely vanishes after correcting for variation in human labels and averaging annotations.
  • On the unseen chemical domain, metrics underperform relative to humans, with lower inter-annotator agreement for humans (0.78–0.83) compared to much higher agreement in the humans’ setting (0.96).
  • The authors recommend evaluating across domains by comparing metric–human agreement to inter-annotator agreement, rather than relying only on raw metric–human scores.

Abstract

Automatic evaluation metrics are central to the development of machine translation systems, yet their robustness under domain shift remains unclear. Most metrics are developed on the Workshop on Machine Translation (WMT) benchmarks, raising concerns about their robustness to unseen domains. Prior studies that analyze unseen domains vary translation systems, annotators, or evaluation conditions, confounding domain effects with human annotation noise. To address these biases, we introduce a systematic multi-annotator Cross-Domain Error-Span-Annotation dataset (CD-ESA), comprising 18.8k human error span annotations across three language pairs, where we fix annotators within each language pair and evaluate translations of the same six translation systems across one seen news domain and two unseen technical domains. Using this dataset, we first find that automatic metrics appear surprisingly robust to domain-shifts at the segment level (up to 0.69 agreement), but this robustness largely disappears once we account for human label variation. Averaging annotations increases inter-annotator agreement by up to +0.11. Metrics struggle on the unseen chemical domain compared to humans (inter-annotator agreement of 0.78-0.83 vs. 0.96). We recommend comparing metric-human agreement against inter-annotator agreement, rather than comparing raw metric-human agreement alone, when evaluating across different domains.