On the Optimal Number of Grids for Differentially Private Non-Interactive $K$-Means Clustering

arXiv stat.ML / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how to choose the optimal number of grids when doing differentially private, non-interactive $K$-means via privatized histograms and discretization.
  • It argues that grid size strongly affects the trade-off between quantization bias and the noise needed to satisfy differential privacy.
  • Rather than using grid sizes independent of $K$ and relying on empirical tuning, the authors derive a new grid-size selection rule by minimizing an upper bound on the expected deviation of the $K$-means objective.
  • The proposed discretization strategy changes how grid resolution scales with the number of clusters, dataset size, and the privacy budget.
  • Extensive experiments show improved clustering accuracy over state-of-the-art methods, including scenarios with tight privacy budgets.

Abstract

Differentially private K-means clustering enables releasing cluster centers derived from a dataset while protecting the privacy of the individuals. Non-interactive clustering techniques based on privatized histograms are attractive because the released data synopsis can be reused for other downstream tasks without additional privacy loss. The choice of the number of grids for discretizing the data points is crucial, as it directly controls the quantization bias and the amount of noise injected to preserve privacy. The widely adopted strategy selects a grid size that is independent of the number of clusters and also relies on empirical tuning. In this work, we revisit this choice and propose a refined grid-size selection rule derived by minimizing an upper bound on the expected deviation in the K-means objective function, leading to a more principled discretization strategy for non-interactive private clustering. Compared to prior work, our grid resolution differs both in its dependence on the number of clusters and in the scaling with dataset size and privacy budget. Extensive numerical results elucidate that the proposed strategy results in accurate clustering compared to the state-of-the-art techniques, even under tight privacy budgets.