Mining Electronic Health Records to Investigate Effectiveness of Ensemble Deep Clustering

arXiv cs.LG / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study evaluates how well traditional (e.g., K-means), hybrid, and deep learning clustering methods work on EHR-derived patient representations, using real heart failure data from the All of Us Research Program.
  • It finds that traditional clustering performs more robustly than deep clustering methods designed for image-like tasks, highlighting a domain mismatch between image clustering and tabular EHR embeddings.
  • To improve deep clustering, the authors propose an ensemble-based deep clustering method that aggregates cluster assignments across multiple embedding dimensions instead of relying on a single embedding space.
  • In a new ensemble framework that combines traditional and deep clustering, the proposed ensemble embedding delivers the best overall performance across 14 clustering approaches and multiple patient cohorts.
  • The paper emphasizes biologically sex-specific clustering as important for EHR analysis and argues for combining traditional and deep clustering rather than using a single method in isolation.

Abstract

In electronic health records (EHRs), clustering patients and distinguishing disease subtypes are key tasks to elucidate pathophysiology and aid clinical decision-making. However, clustering in healthcare informatics is still based on traditional methods, especially K-means, and has achieved limited success when applied to embedding representations learned by autoencoders as hybrid methods. This paper investigates the effectiveness of traditional, hybrid, and deep learning methods in heart failure patient cohorts using real EHR data from the All of Us Research Program. Traditional clustering methods perform robustly because deep learning approaches are specifically designed for image clustering, a task that differs substantially from the tabular EHR data setting. To address the shortcomings of deep clustering, we introduce an ensemble-based deep clustering approach that aggregates cluster assignments obtained from multiple embedding dimensions, rather than relying on a single fixed embedding space. When combined with traditional clustering in a novel ensemble framework, the proposed ensemble embedding for deep clustering delivers the best overall performance ranking across 14 diverse clustering methods and multiple patient cohorts. This paper underscores the importance of biological sex-specific clustering of EHR data and the advantages of combining traditional and deep clustering approaches over a single method.