Weighted quantization using MMD: From mean field to mean shift via gradient flows

arXiv stat.ML / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how to approximate a target probability distribution with a weighted set of Dirac particles (quantization) using maximum mean discrepancy (MMD) rather than the more common Wasserstein metric.
  • It argues that a Wasserstein-Fisher-Rao gradient flow provides a principled way to design MMD-optimal quantizations, and it shows how an interacting-particle ODE system can discretize this flow.
  • Building on this framework, the authors derive a new fixed-point method called mean shift interacting particles (MSIP) and show that it extends classical mean shift used for mode finding in kernel density estimation.
  • The work interprets MSIP both as a form of preconditioned gradient descent and as a relaxation of Lloyd’s algorithm for clustering, connecting multiple established ideas under one theory.
  • Numerical experiments on high-dimensional, multi-modal settings suggest the resulting algorithms are more robust than existing state-of-the-art approaches.

Abstract

Approximating a probability distribution using a set of particles is a fundamental problem in machine learning and statistics, with applications including clustering and quantization. Formally, we seek a weighted mixture of Dirac measures that best approximates the target distribution. While much existing work relies on the Wasserstein distance to quantify approximation errors, maximum mean discrepancy (MMD) has received comparatively less attention, especially when allowing for variable particle weights. We argue that a Wasserstein-Fisher-Rao gradient flow is well-suited for designing quantizations optimal under MMD. We show that a system of interacting particles satisfying a set of ODEs discretizes this flow. We further derive a new fixed-point algorithm called mean shift interacting particles (MSIP). We show that MSIP extends the classical mean shift algorithm, widely used for identifying modes in kernel density estimators. Moreover, we show that MSIP can be interpreted as preconditioned gradient descent and that it acts as a relaxation of Lloyd's algorithm for clustering. Our unification of gradient flows, mean shift, and MMD-optimal quantization yields algorithms that are more robust than state-of-the-art methods, as demonstrated via high-dimensional and multi-modal numerical experiments.