A Framework for Exploring and Disentangling Intersectional Bias: A Case Study in Fetal Ultrasound

arXiv cs.LG / 5/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that in image-based medical AI tasks like fetal ultrasound, performance gaps may persist even with adequate demographic representation because accuracy depends heavily on image quality.
  • It proposes a structured framework to detect intersectional bias by combining unsupervised slice discovery, factor-wise analysis, and targeted intersectional evaluation to disentangle demographic, clinical, and acquisition influences.
  • Using 94,000+ fetal ultrasound images, the study analyzes bias in both a state-of-the-art deep learning model and the Hadlock regression standard and finds that pixel spacing (PS) is a consistent driver of performance differences.
  • The authors report that higher PS can yield up to ~24% improvements for certain subgroups, but because PS is often adjusted for high maternal BMI or low gestational age (GA), the effect risks confounding.
  • Their intersectional results suggest some of the PS-related signal is explained by GA, while PS improvements remain across BMI groups, underscoring the need for acquisition-aware and interaction-aware fairness evaluation in medical AI.

Abstract

Bias in medical AI is often framed as a problem of representation. However, in image-based tasks such as fetal ultrasound, performance disparities can arise even when representation is adequate, because predictive accuracy depends strongly on image quality. Image quality is shaped by acquisition conditions and operator expertise, as well as patient-dependent factors such as maternal body mass index (BMI), all of which may correlate with sensitive demographic features. Consequently, observed disparities may reflect the combined influence of demographic, clinical, and acquisition-related factors rather than data imbalance alone, and may obscure underlying interaction or confounding effects. We propose a structured framework to explore and detect intersectional bias, combining unsupervised slice discovery, systematic factor-wise analysis, and targeted intersectional evaluation. In a case study of over 94{,}000 ultrasound images for fetal weight estimation, we analyze bias in a state-of-the-art deep learning (DL) model and the clinical standard Hadlock, a regression formula using biometric measurements. Pixel spacing (PS) -- a parameter considered suboptimal in current acquisition protocols -- emerged as a consistent driver of performance differences, with higher PS associated with improvements of up to 24\% in selected subgroups for both models. Because PS is often adapted in cases of high BMI or low gestational age (GA), this effect carries a substantial risk of confounding. Our intersectional analysis revealed that part of the PS-associated signal is explained by GA, while PS-related improvements persist across BMI strata, highlighting the importance of acquisition-aware and interaction-aware evaluation in medical AI fairness research.