Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare

arXiv stat.ML / 4/16/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper argues that healthcare ML evaluations often miss patient-level instability in risk estimates, even when aggregate metrics and model/data setup are unchanged.
  • It shows that for overparameterized models, randomness from optimization and initialization can produce materially different predictions for the same patient, creating procedural arbitrariness.
  • The authors propose two diagnostics—empirical prediction interval width (ePIW) for continuous risk variability and empirical decision flip rate (eDFR) for threshold-based treatment instability.
  • Experiments on simulated data and the GUSTO-I clinical dataset find that flexible ML models can show instability from optimization/initialization comparable to full training-data resampling, with neural networks more unstable than logistic regression.
  • The study concludes that instability near clinical decision thresholds can change recommendations and should be included in routine clinical model validation.

Abstract

In healthcare, predictive models increasingly inform patient-level decisions, yet little attention is paid to the variability in individual risk estimates and its impact on treatment decisions. For overparameterized models, now standard in machine learning, a substantial source of variability often goes undetected. Even when the data and model architecture are held fixed, randomness introduced by optimization and initialization can lead to materially different risk estimates for the same patient. This problem is largely obscured by standard evaluation practices, which rely on aggregate performance metrics (e.g., log-loss, accuracy) that are agnostic to individual-level stability. As a result, models with indistinguishable aggregate performance can nonetheless exhibit substantial procedural arbitrariness, which can undermine clinical trust. We propose an evaluation framework that quantifies individual-level prediction instability by using two complementary diagnostics: empirical prediction interval width (ePIW), which captures variability in continuous risk estimates, and empirical decision flip rate (eDFR), which measures instability in threshold-based clinical decisions. We apply these diagnostics to simulated data and GUSTO-I clinical dataset. Across observed settings, we find that for flexible machine-learning models, randomness arising solely from optimization and initialization can induce individual-level variability comparable to that produced by resampling the entire training dataset. Neural networks exhibit substantially greater instability in individual risk predictions compared to logistic regression models. Risk estimate instability near clinically relevant decision thresholds can alter treatment recommendations. These findings that stability diagnostics should be incorporated into routine model validation for assessing clinical reliability.