A Novel Theoretical Analysis for Clustering Heteroscedastic Gaussian Data without Knowledge of the Number of Clusters

arXiv stat.ML / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies clustering when measurements are heteroscedastic, assuming each cluster’s data are Gaussian with potentially different and unknown covariance matrices around a centroid.
  • It introduces a new centroid cost function whose gradient fixed-points generalize Mean-Shift, and it proves that—when cluster sizes and centroid separations are sufficiently large—these fixed-points correspond to the true cluster centroids.
  • A new “Wald kernel” is proposed, defined via the p-value of a Wald hypothesis test for Gaussian means, aimed at measuring cluster membership plausibility while scaling better with feature dimension than a standard Gaussian kernel.
  • Using this theoretical framework, the authors derive the CENTRE-X clustering algorithm, which (like Mean-Shift) does not require the number of clusters and uses the Wald test to reduce the number of candidate fixed-points, improving computational complexity.
  • Simulations on synthetic and real datasets indicate CENTRE-X achieves comparable or better clustering performance than K-means and Mean-Shift even when covariance information is imperfect or unknown.

Abstract

This paper addresses the problem of clustering measurement vectors that are heteroscedastic in that they can have different covariance matrices. From the assumption that the measurement vectors within a given cluster are Gaussian distributed with possibly different and unknown covariant matrices around the cluster centroid, we introduce a novel cost function to estimate the centroids. The zeros of the gradient of this cost function turn out to be the fixed-points of a certain function. As such, the approach generalizes the methodology employed to derive the existing Mean-Shift algorithm. But as a main and novel theoretical result compared to Mean-Shift, this paper shows that the sole fixed-points of the identified function tend to be the cluster centroids if both the number of measurements per cluster and the distances between centroids are large enough. As a second contribution, this paper introduces the Wald kernel for clustering. This kernel is defined as the p-value of the Wald hypothesis test for testing the mean of a Gaussian. As such, the Wald kernel measures the plausibility that a measurement vector belongs to a given cluster and it scales better with the dimension of the measurement vectors than the usual Gaussian kernel. Finally, the proposed theoretical framework allows us to derive a new clustering algorithm called CENTRE-X that works by estimating the fixed-points of the identified function. As Mean-Shift, CENTRE-X requires no prior knowledge of the number of clusters. It relies on a Wald hypothesis test to significantly reduce the number of fixed points to calculate compared to the Mean-Shift algorithm, thus resulting in a clear gain in complexity. Simulation results on synthetic and real data sets show that CENTRE-X has comparable or better performance than standard clustering algorithms K-means and Mean-Shift, even when the covariance matrices are not perfectly known.