An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness

arXiv cs.AI / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The study addresses how AI/ML models used in clinical decision-making can degrade when training data become stale, especially due to demographic or behavioral changes.
  • Using four U.S.-based Type 1 Diabetes datasets with high-resolution continuous glucose monitoring data, the authors evaluate how model update strategies can create additional risks beyond improved accuracy.
  • The evaluation shows that model updates may harm stability by causing large numbers of predictions to “flip” after an update, while also increasing arbitrariness in prediction behavior.
  • The authors further assess fairness impacts, finding that updates can worsen accuracy equity and disrupt error-rate balance across sociodemographic subpopulations.
  • They propose a continuous monitoring framework with multiple dimensions to detect stability, arbitrariness, and fairness failures, arguing it is essential for trustworthy clinical decision support.

Abstract

Artificial Intelligence and Machine Learning (AI/ML) models used in clinical settings are increasingly deployed to support clinical decision-making. However, when training data become stale due to changes in demographics, environment, or patient behaviors, model performance can degrade substantially. While updating models with new training data is necessary, such updates may also introduce new risks. We evaluated the proposed monitoring framework on four publicly available U.S.-based Type 1 Diabetes datasets containing high-resolution continuous glucose monitoring (CGM) data, comprising approximately 11,300 weekly observations from 496 participants under 20 years of age. All datasets included structured sociodemographic information. Using the prediction of severe hyperglycemia events in children with type 1 diabetes as a case study, we examine how different model update strategies can adversely affect model stability (e.g., by causing predictions to "flip" for a large number of cases after an update), increase arbitrariness in predictions, or worsen accuracy equity and the balance of error rates across subpopulations. We propose multiple dimensions for continuous monitoring to detect these issues and argue that such monitoring is essential for the development of trustworthy clinical decision support systems.