Scalable Posterior Uncertainty for Flexible Density-Based Clustering

arXiv stat.ML / 4/20/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper proposes a new clustering uncertainty-quantification framework that treats clusters as functionals of the data-generating density rather than as latent mixture components.
  • It builds martingale posterior samples using a predictive resampling scheme driven by model score evaluations, enabling uncertainty estimates for clustering without relying on parametric density assumptions.
  • The method leverages differentiable density estimators—especially normalizing flows—to make density resampling efficient and highly parallelizable on GPUs for large-scale workloads.
  • By clustering each sampled density draw, the approach yields posterior samples of the clustering structure, supporting principled inference over clustering-related quantities.
  • Experiments on image data and single-cell RNA-seq demonstrate GPU-accelerated computational efficiency and the ability to recover meaningful clusters along with uncertainty across domains.

Abstract

We introduce a novel framework for uncertainty quantification in clustering that combines martingale posterior distributions with density-based clustering. Unlike classical model-based approaches, which define clusters at the latent level of a mixture model, we treat clusters as explicit functionals of the data-generating density, without assuming any specific parametric form. To characterize density uncertainty, we obtain martingale posterior samples via a predictive resampling scheme driven by model score evaluations. This allows us to leverage state-of-the-art differentiable density estimators, such as normalizing flows, making density resampling efficient in large-scale settings and fully parallelizable on modern GPU hardware. Martingale posterior samples of the clustering structure are then obtained by applying density-based clustering to the density draws, enabling principled inference on any clustering-related quantity. Casting the inference target as a density functional further enables a rigorous theoretical analysis of the procedure's convergence properties. We apply our methodology to image and single-cell RNA sequencing data, demonstrating the computational efficiency afforded by its GPU compatibility as well as its ability to recover meaningful clustering structures, with associated uncertainty, across diverse domains.