Language-Assisted Image Clustering Guided by Discriminative Relational Signals and Adaptive Semantic Centers

arXiv cs.LG / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a new Language-Assisted Image Clustering (LAIC) framework that uses vision-language models to add text to images to improve clustering quality.
  • It targets shortcomings in prior LAIC methods, including overly similar per-image textual features that reduce inter-class discriminability.
  • The approach generates more discriminative cross-modal self-supervision signals using relational cues, allowing it to work with most VLM training mechanisms.
  • It learns category-wise, continuous semantic centers via prompt learning to guide final clustering assignments instead of relying only on fixed pre-built image-text alignments.
  • Experiments across eight benchmark datasets show an average 2.6% improvement over state-of-the-art methods, and the semantic centers are reported to be interpretable.

Abstract

Language-Assisted Image Clustering (LAIC) augments the input images with additional texts with the help of vision-language models (VLMs) to promote clustering performance. Despite recent progress, existing LAIC methods often overlook two issues: (i) textual features constructed for each image are highly similar, leading to weak inter-class discriminability; (ii) the clustering step is restricted to pre-built image-text alignments, limiting the potential for better utilization of the text modality. To address these issues, we propose a new LAIC framework with two complementary components. First, we exploit cross-modal relations to produce more discriminative self-supervision signals for clustering, as it compatible with most VLMs training mechanisms. Second, we learn category-wise continuous semantic centers via prompt learning to produce the final clustering assignments. Extensive experiments on eight benchmark datasets demonstrate that our method achieves an average improvement of 2.6% over state-of-the-art methods, and the learned semantic centers exhibit strong interpretability. Code is available in the supplementary material.