FairQE: Multi-Agent Framework for Mitigating Gender Bias in Translation Quality Estimation

arXiv cs.AI / 4/25/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Quality Estimation (QE) models for machine translation can show systematic gender bias, including favoring masculine outputs in ambiguous contexts and over-scoring gender-mismatched translations even when gender is explicitly provided.
  • FairQE is introduced as a fairness-aware, multi-agent framework that detects gender cues, generates gender-flipped translation variants, and uses these to counter bias in both gender-ambiguous and gender-explicit cases.
  • The framework integrates conventional QE scoring with LLM-based bias-mitigating reasoning via a dynamic, bias-aware aggregation mechanism, aiming to remain “plug-and-play” with existing QE systems.
  • Experiments across multiple gender-bias evaluation settings show consistent fairness improvements versus strong QE baselines, while meta-evaluation using MQM-based methods after WMT 2023 Metrics Shared Task indicates competitive or better overall QE performance.
  • Overall, the work suggests gender bias in translation evaluation can be reduced effectively without sacrificing evaluation accuracy, improving reliability of translation assessment.

Abstract

Quality Estimation (QE) aims to assess machine translation quality without reference translations, but recent studies have shown that existing QE models exhibit systematic gender bias. In particular, they tend to favor masculine realizations in gender-ambiguous contexts and may assign higher scores to gender-misaligned translations even when gender is explicitly specified. To address these issues, we propose FairQE, a multi-agent-based, fairness-aware QE framework that mitigates gender bias in both gender-ambiguous and gender-explicit scenarios. FairQE detects gender cues, generates gender-flipped translation variants, and combines conventional QE scores with LLM-based bias-mitigating reasoning through a dynamic bias-aware aggregation mechanism. This design preserves the strengths of existing QE models while calibrating their gender-related biases in a plug-and-play manner. Extensive experiments across multiple gender bias evaluation settings demonstrate that FairQE consistently improves gender fairness over strong QE baselines. Moreover, under MQM-based meta-evaluation following the WMT 2023 Metrics Shared Task, FairQE achieves competitive or improved general QE performance. These results show that gender bias in QE can be effectively mitigated without sacrificing evaluation accuracy, enabling fairer and more reliable translation evaluation.