Who Trains Matters: Federated Learning under Enrollment and Participation Selection Biases

arXiv cs.LG / 4/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper shows that federated learning can suffer from two distinct selection biases—enrollment bias (who is ever eligible/reachable) and participation bias (who actually participates each round)—which can break the representativeness assumption behind FL training.
  • It formalizes federated learning under a two-stage client selection model and introduces FedIPW, an inverse-probability-weighted aggregation method to recover target-population mean updates under standard ignorability/positivity assumptions.
  • Since covariates for non-enrolled clients are often missing, it also proposes a limited-information aggregate-calibration extension that reweights enrolled clients using known target-population summaries to partially correct enrollment bias.
  • The authors analyze algorithm-agnostic optimization effects and find that incomplete selection correction can leave a persistent (non-vanishing) bias floor.
  • Experiments with synthetic federated logistic regression confirm the objective mismatch predicted by theory and demonstrate that enrollment correction reduces target-population error under two-stage selection.

Abstract

Federated learning (FL) trains a shared model from updates contributed by distributed clients, often implicitly assuming that contributing clients are representative of the target population. In practice, this representativeness assumption can fail at two distinct stages, inducing selection bias. First, eligibility rules such as device constraints, software requirements, or user consent determine which clients are ever enrolled and reachable for training, inducing \emph{enrollment bias}. Second, among enrolled clients, user and system factors such as battery state, network status, and local time determine which clients participate in each communication round, inducing \emph{participation bias}. Although existing work has largely addressed round-level participation bias, it has paid far less attention to population-level enrollment bias, which can induce a persistent mismatch between the training objective and the target-population objective. We formalize FL under a two-stage selection model and derive \textsc{FedIPW}, an inverse-probability-weighted aggregation scheme that recovers the target-population mean update under standard ignorability and positivity assumptions. Because client-level covariates are often unavailable for non-enrolled clients, we also introduce a limited-information aggregate-calibration extension that uses known target-population summaries to reweight the enrolled sample, partially correcting enrollment bias. We further provide an algorithm-agnostic optimization analysis under residual weighting error and show that incomplete selection correction can induce a non-vanishing bias floor. Finally, experiments on synthetic federated logistic regression validate the predicted objective mismatch and show that enrollment correction reduces target-population error under two-stage selection.