When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning

arXiv cs.LG / 4/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that fairness evaluation in machine learning can be unreliable because different fairness metrics measure different statistical properties and may contradict each other for the same model.
  • Using face recognition as a controlled setting, the authors test model performance across multiple demographic group partitions with a variety of commonly used fairness metrics, including error-rate disparity and performance-based measures.
  • The study finds that fairness conclusions can change substantially depending on the metric selected, producing conflicting determinations about whether a model is biased.
  • To capture and quantify this inconsistency, the authors propose the Fairness Disagreement Index (FDI) and show that fairness disagreement remains high across decision thresholds and model configurations.
  • The results suggest that reporting fairness with a single metric is insufficient for trustworthy bias assessment, and multi-metric reporting is needed for reliability in high-stakes domains.

Abstract

The evaluation of fairness in machine learning systems has become a central concern in high-stakes applications, including biometric recognition, healthcare decision-making, and automated risk assessment. Existing approaches typically rely on a small number of fairness metrics to assess model behaviour across group partitions, implicitly assuming that these metrics provide consistent and reliable conclusions. However, different fairness metrics capture distinct statistical properties of model performance and may therefore produce conflicting assessments when applied to the same system. In this work, we investigate the consistency of fairness evaluation by conducting a systematic multi-metric analysis of demographic bias in machine learning models. Using face recognition as a controlled experimental setting, we evaluate model performance across multiple group partitions under a range of commonly used fairness metrics, including error-rate disparities and performance-based measures. Our results demonstrate that fairness assessments can vary significantly depending on the choice of metrics, leading to contradictory conclusions regarding model bias. To quantify this phenomenon, we introduce the Fairness Disagreement Index (FDI), a measure designed to capture the degree of inconsistency across fairness metrics. We further show that disagreement remains high across thresholds and model configurations. These findings highlight a critical limitation in current fairness evaluation practices and suggest that single-metric reporting is insufficient for reliable bias assessment.