FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition

arXiv cs.LG / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that standard loss-based evaluation can miss performance shifts affecting specific subgroups, motivating better fairness auditing methods.
  • It introduces FairTree, an algorithm for subgroup fairness auditing that can directly handle continuous, categorical, and ordinal features without discretization.
  • FairTree extends prior auditing ideas by decomposing performance disparities into systematic bias and variance, enabling a clearer interpretation of why subgroup performance changes.
  • The authors propose two variants—a permutation-based method and a fluctuation test—and simulation results show both have acceptable false-positive rates, with the fluctuation approach achieving higher power.
  • They demonstrate the approach on the UCI Adult Census dataset, suggesting the framework can support statistical evaluation of fairness even with relatively small datasets.

Abstract

The evaluation of machine learning models typically relies mainly on performance metrics based on loss functions, which risk to overlook changes in performance in relevant subgroups. Auditing tools such as SliceFinder and SliceLine were proposed to detect such groups, but usually have conceptual disadvantages, such as the inability to directly address continuous covariates. In this paper, we introduce FairTree, a novel algorithm adapted from psychometric invariance testing. Unlike SliceFinder and related algorithms, FairTree directly handles continuous, categorical, and ordinal features without discretization. It further decomposes performance disparities into systematic bias and variance, allowing a categorization of changes in algorithm performance. We propose and evaluate two variations of the algorithm: a permutation-based approach, which is conceptually closer to SliceFinder, and a fluctuation test. Through simulation studies that include a direct comparison with SliceLine, we demonstrate that both approaches have a satisfactory rate of false-positive results, but that the fluctuation approach has relatively higher power. We further illustrate the method on the UCI Adult Census dataset. The proposed algorithms provide a flexible framework for the statistical evaluation of the performance and aspects of fairness of machine learning models in a wide range of applications even in relatively small data.