Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities

arXiv cs.AI / 4/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The perspective argues that bias contributing to healthcare disparities can originate at the earliest stages of biomedical research, especially during molecular-level data collection and research prioritization.
  • An analysis of 4,719 PubMed-indexed omics studies (2015–2024) finds that only a small fraction report ancestry or ethnicity, and reported demographic data shows significant bias.
  • Examination of major training datasets (e.g., CellxGene and GEO) reveals substantial population bias, with data from European ancestry strongly dominating.
  • As biomedical foundation models reuse pretrained base models for many downstream tasks, these early dataset biases may be perpetuated or amplified, potentially creating cascading inequities that regulators alone cannot fully fix.
  • The authors propose community-wide principles—Provenance, Openness, and Evaluation Transparency—to improve equity and robustness in biomedical AI.

Abstract

Healthcare disparities persist across socioeconomic boundaries, often attributed to unequal access to screening, diagnostics, and therapeutics. However, this perspective highlights that critical biases can emerge much earlier, during data collection and research prioritization, long before clinical implementation in cases where the focus of the studies and the data that is collected is at the molecular level. A vast number of studies focus on collecting omics data but the demographic information associated with these datasets is often not reported in the studies, and when it is reported, it shows big biases. An automated analysis of 4719 PubMed-indexed omics publications from 2015 to 2024 reveals that only a small fraction report ancestry or ethnicity information, with ancestry reporting improving slightly. Analysis of large-scale datasets commonly used for model training, such as CellxGene and GEO, reveals substantial population bias where European-ancestry data dominates. As biomedical foundation models become central to biomedical discovery with a paradigm in which base models are pretrained on large datasets and reusing them time and again for many different downstream tasks, they risk perpetuating or amplifying these early-stage biases, leading to cascading inequities that regulatory interventions cannot fully reverse. We propose a community-wide focus on three foundational principles: Provenance, Openness, and Evaluation Transparency to improve equity and robustness in biomedical AI. This approach aims to foster biomedical innovation that more effectively serves underserved populations and improves health outcomes.