Turtle shell clustering: A mixture approach to discriminative clustering with applications to flow cytometry and other data

arXiv stat.ML / 4/28/2026

📰 NewsModels & Research

Key Points

  • The paper introduces “turtle shell clustering,” a fully unsupervised probabilistic method that combines geometric (generative) and boundary-focused (discriminative) ideas via a regularized mutual information objective.
  • It models the conditional distribution using a “mixture of mixtures” consisting of Gaussian components and uniform distributions, helping the method handle noise and irregular cluster shapes.
  • The approach includes automatic selection of the number of components using a regularization term plus a merge step, drawing inspiration from reversible-jump MCMC techniques for Bayesian clustering.
  • Experiments on both simulated and real clustering datasets, including flow cytometry data, are used to demonstrate the method’s ability to estimate non-linear decision boundaries and recover intuitive clusters despite anomalies.
  • Overall, the work presents a new clustering framework intended to improve discriminative clustering quality without supervision and with built-in robustness to abnormal data patterns.

Abstract

Generative approaches to clustering provide information on geometric properties of clusters, whereas discriminative approaches provide boundaries between clusters. Ideas from both approaches are incorporated to present a fully unsupervised, probabilistic, and discriminative clustering method via a regularized mutual information objective function, wherein a mixture of mixtures of Gaussian and uniform distributions is used for formulation of the conditional model. Automatic selection of the number of components is established with the introduction of the regularizing term and a merge step, similar to those applied in reversible jump Markov chain Monte Carlo methods used in Bayesian clustering. Consequently, the turtle shell method -- a fully unsupervised clustering method capable of estimating non-linear boundary lines, automatically selecting the number of components, and capturing intuitive clusters in the presence of data abnormalities such as noise and/or irregular cluster shapes -- is introduced. We test this method on various simulated and real datasets commonly explored in clustering research, and extend the analysis to datasets arising from flow cytometry experiments.