Overcoming Selection Bias in Statistical Studies With Amortized Bayesian Inference

arXiv stat.ML / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses selection bias in statistical studies, where sample inclusion depends on variables related to the quantities of interest, distorting both estimates and uncertainty quantification.
  • It proposes a bias-aware simulation-based inference framework that embeds the selection mechanism into the generative simulator to enable amortized Bayesian inference without requiring tractable likelihoods.
  • Unlike simulation-based inference methods that assume missingness at random, the approach is designed to handle cases where selection depends on unobserved outcomes or covariates.
  • The method provides diagnostics to detect discrepancies between simulated and observed data and to check posterior calibration, allowing researchers to test whether bias is present.
  • Experiments across three statistical applications with different selection mechanisms show the framework produces well-calibrated, debiased posteriors, including scenarios where likelihood-based corrections fail.

Abstract

Selection bias arises when the probability that an observation enters a dataset depends on variables related to the quantities of interest, leading to systematic distortions in estimation and uncertainty quantification. For example, in epidemiological or survey settings, individuals with certain outcomes may be more likely to be included, resulting in biased prevalence estimates with potentially substantial downstream impact. Classical corrections, such as inverse-probability weighting or explicit likelihood-based models of the selection process, rely on tractable likelihoods, which limits their applicability in complex stochastic models with latent dynamics or high-dimensional structure. Simulation-based inference enables Bayesian analysis without tractable likelihoods but typically assumes missingness at random and thus fails when selection depends on unobserved outcomes or covariates. Here, we develop a bias-aware simulation-based inference framework that explicitly incorporates selection into neural posterior estimation. By embedding the selection mechanism directly into the generative simulator, the approach enables amortized Bayesian inference without requiring tractable likelihoods. This recasting of selection bias as part of the simulation process allows us to both obtain debiased estimates and explicitly test for the presence of bias. The framework integrates diagnostics to detect discrepancies between simulated and observed data and to assess posterior calibration. The method recovers well-calibrated posterior distributions across three statistical applications with diverse selection mechanisms, including settings in which likelihood-based approaches yield biased estimates. These results recast the correction of selection bias as a simulation problem and establish simulation-based inference as a practical and testable strategy for parameter estimation under selection bias.