Composite Silhouette: A Subsampling-based Aggregation Strategy

arXiv cs.LG / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses unsupervised model selection for estimating the number of clusters, highlighting that the standard (micro-averaged) Silhouette coefficient can be biased toward larger clusters when cluster sizes are imbalanced.
  • It proposes “Composite Silhouette,” which aggregates information across multiple subsampled clusterings instead of relying on a single partition, aiming to reduce both size-bias and noise from small clusters.
  • For each subsample, the method adaptively combines micro- and macro-averaged Silhouette scores using a convex weight based on normalized discrepancy, smoothed by a bounded nonlinearity to control overreactions.
  • The authors prove theoretical properties and provide finite-sample concentration guarantees for the subsampling-based estimate.
  • Experiments on synthetic and real-world datasets show that Composite Silhouette better recovers the ground-truth number of clusters than standard micro or macro approaches.

Abstract

Determining the number of clusters is a central challenge in unsupervised learning, where ground-truth labels are unavailable. The Silhouette coefficient is a widely used internal validation metric for this task, yet its standard micro-averaged form tends to favor larger clusters under size imbalance. Macro-averaging mitigates this bias by weighting clusters equally, but may overemphasize noise from under-represented groups. We introduce Composite Silhouette, an internal criterion for cluster-count selection that aggregates evidence across repeated subsampled clusterings rather than relying on a single partition. For each subsample, micro- and macro-averaged Silhouette scores are combined through an adaptive convex weight determined by their normalized discrepancy and smoothed by a bounded nonlinearity; the final score is then obtained by averaging these subsample-level composites. We establish key properties of the criterion and derive finite-sample concentration guarantees for its subsampling estimate. Experiments on synthetic and real-world datasets show that Composite Silhouette effectively reconciles the strengths of micro- and macro-averaging, yielding more accurate recovery of the ground-truth number of clusters.