Abstract
Motivated by applications in statistics and machine learning, we consider a problem of unmixing convex combinations of nonparametric densities. Suppose we observe n groups of samples, where the ith group consists of N_i independent samples from a d-variate density f_i(x)=\sum_{k=1}^K \pi_i(k)g_k(x). Here, each g_k(x) is a nonparametric density, and each \pi_i is a K-dimensional mixed membership vector. We aim to estimate g_1(x), \ldots,g_K(x). This problem generalizes topic modeling from discrete to continuous variables and finds its applications in LLMs with word embeddings.
In this paper, we propose an estimator for the above problem, which modifies the classical kernel density estimator by assigning group-specific weights that are computed by topic modeling on histogram vectors and de-biased by U-statistics. For any \beta>0, assuming that each g_k(x) is in the Nikol'ski class with a smooth parameter \beta, we show that the sum of integrated squared errors of the constructed estimators has a convergence rate that depends on n, K, d, and the per-group sample size N. We also provide a matching lower bound, which suggests that our estimator is rate-optimal.