Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains
arXiv cs.CL / 4/21/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper examines whether automatic machine-translation evaluation metrics remain reliable when tested on unseen domains, noting that WMT-trained metrics may not generalize beyond benchmark settings.
- It introduces CD-ESA, a cross-domain, multi-annotator error-span annotation dataset with 18.8k human annotations across three language pairs, keeping annotators fixed per language pair while evaluating six translation systems across one seen news domain and two unseen technical domains.
- The results show that metrics can look robust to domain shift at the segment level, but that perceived robustness largely vanishes after correcting for variation in human labels and averaging annotations.
- On the unseen chemical domain, metrics underperform relative to humans, with lower inter-annotator agreement for humans (0.78–0.83) compared to much higher agreement in the humans’ setting (0.96).
- The authors recommend evaluating across domains by comparing metric–human agreement to inter-annotator agreement, rather than relying only on raw metric–human scores.
Related Articles

Claude and I aren't vibing at all
Dev.to

The ULTIMATE Guide to AI Voice Cloning: RVC WebUI (Zero to Hero)
Dev.to

From Generic to Granular: AI-Powered CMA Personalization for Solo Agents
Dev.to

Kiwi-chan Devlog #007: The Audit Never Sleeps (and Neither Does My GPU)
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to