Kempe Swap K-Means: A Scalable Near-Optimal Solution for Semi-Supervised Clustering

arXiv cs.LG / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Kempe Swap K-Means, a centroid-based heuristic for constrained clustering that supports rigid must-link and cannot-link requirements.
  • It uses a two-phase iterative approach: an assignment refinement step via Kempe chain swaps followed by a centroid update step that computes optimal centroids given the current assignments.
  • To improve exploration and reduce the risk of poor local optima, the method adds controlled perturbations during the centroid update phase to enable more global search.
  • Experiments on large-scale datasets show the algorithm achieves near-optimal partitions while remaining computationally efficient and scalable.
  • Reported results indicate Kempe Swap K-Means outperforms existing state-of-the-art benchmarks on both clustering accuracy and runtime/efficiency.

Abstract

This paper presents a novel centroid-based heuristic algorithm, termed Kempe Swap K-Means, for constrained clustering under rigid must-link (ML) and cannot-link (CL) constraints. The algorithm employs a dual-phase iterative process: an assignment step that utilizes Kempe chain swaps to refine current clustering in the constrained solution space and a centroid update step that computes optimal cluster centroids. To enhance global search capabilities and avoid local optima, the framework incorporates controlled perturbations during the update phase. Empirical evaluations demonstrate that the proposed method achieves near-optimal partitions while maintaining high computational efficiency and scalability. The results indicate that Kempe Swap K-Means consistently outperforms state-of-the-art benchmarks in both clustering accuracy and algorithmic efficiency for large-scale datasets.