Fast estimation of Gaussian mixture components via centering and singular value thresholding

arXiv stat.ML / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper tackles the unsupervised challenge of estimating the number of components in high-dimensional Gaussian mixture models, especially when component sizes are highly imbalanced.
It introduces a non-iterative estimator that centers the data, computes singular values of the centered matrix, and counts singular values above a chosen threshold.
The authors provide a theoretical guarantee: with a mild separation condition on component centers, the estimator consistently recovers the true number of components.
The method is shown to work in extreme regimes where dimensionality can greatly exceed sample size and where the number of components grows up to the smaller of the dimension and sample size, even under severe imbalance.
Empirically, the approach is both accurate in difficult settings and extremely fast, reportedly handling 10 million samples in 100 dimensions in about one minute.

Abstract

Estimating the number of components is a fundamental challenge in unsupervised learning, particularly when dealing with high-dimensional data with many components or severely imbalanced component sizes. This paper addresses this challenge for classical Gaussian mixture models. The proposed estimator is simple: center the data, compute the singular values of the centered matrix, and count those above a threshold. No iterative fitting, no likelihood calculation, and no prior knowledge of the number of components are required. We prove that, under a mild separation condition on the component centers, the estimator consistently recovers the true number of components. The result holds in high-dimensional settings where the dimension can be much larger than the sample size. It also holds when the number of components grows to the smaller of the dimension and the sample size, even under severe imbalance among component sizes. Computationally, the method is extremely fast: for example, it processes ten million samples in one hundred dimensions within one minute. Extensive experimental studies confirm its accuracy in challenging settings such as high dimensionality, many components, and severe class imbalance.