Scalable Model-Based Clustering with Sequential Monte Carlo

arXiv stat.ML / 4/17/2026

📰 NewsModels & Research

Key Points

  • The paper addresses online clustering under uncertainty, where cluster assignments remain ambiguous until additional data arrives.
  • It proposes a new Sequential Monte Carlo (SMC) algorithm that reduces the typical memory bottleneck by decomposing the clustering task into approximately independent subproblems.
  • The method is designed to handle clustering with complex cluster distributions, which is especially relevant for text data.
  • The authors motivate the approach using the knowledge base construction problem and report that it can solve clustering tasks accurately and efficiently in settings where traditional SMC methods struggle.

Abstract

In online clustering problems, there is often a large amount of uncertainty over possible cluster assignments that cannot be resolved until more data are observed. This difficulty is compounded when clusters follow complex distributions, as is the case with text data. Sequential Monte Carlo (SMC) methods give a natural way of representing and updating this uncertainty over time, but have prohibitive memory requirements for large-scale problems. We propose a novel SMC algorithm that decomposes clustering problems into approximately independent subproblems, allowing a more compact representation of the algorithm state. Our approach is motivated by the knowledge base construction problem, and we show that our method is able to accurately and efficiently solve clustering problems in this setting and others where traditional SMC struggles.