One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

arXiv cs.CL / 5/1/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper explains that hubness—where some embeddings become “hubs” close to many unrelated examples—can undermine embedding-based tasks in high-dimensional spaces.
  • It focuses on cross-modal encoders like CLIP-style models that map images and text into a shared embedding space, arguing that the presence of hub embeddings can be exploited.
  • The authors propose a method to identify a specific hub embedding and the corresponding hub text that triggers abnormal cross-modal similarity.
  • Experiments on MSCOCO and nocaps for caption evaluation, and on MSCOCO and Flickr30k for image-to-text retrieval, show that the method can find a single hub text producing similarity scores comparable to or higher than human-written captions across many images.
  • The results indicate a practical vulnerability in cross-modal encoder evaluation and retrieval pipelines, where automated metrics may be gamed by hub text.

Abstract

The hubness problem, in which hub embeddings are close to many unrelated examples, occurs often in high-dimensional embedding spaces and may pose a practical threat for purposes such as information retrieval and automatic evaluation metrics. In particular, since cross-modal similarity between text and images cannot be calculated by direct comparisons, such as string matching, cross-modal encoders that project different modalities into a shared space are helpful for various cross-modal applications, and thus, the existence of hubs may pose practical threats. To reveal the vulnerabilities of cross-modal encoders, we propose a method for identifying the hub embedding and its corresponding hub text. Experiments on image captioning evaluation in MSCOCO and nocaps along with image-to-text retrieval tasks in MSCOCO and Flickr30k showed that our method can identify a single hub text that unreasonably achieves comparable or higher similarity scores than human-written reference captions in many images, thereby revealing the vulnerabilities in cross-modal encoders.