One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness
arXiv cs.CL / 5/1/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper explains that hubness—where some embeddings become “hubs” close to many unrelated examples—can undermine embedding-based tasks in high-dimensional spaces.
- It focuses on cross-modal encoders like CLIP-style models that map images and text into a shared embedding space, arguing that the presence of hub embeddings can be exploited.
- The authors propose a method to identify a specific hub embedding and the corresponding hub text that triggers abnormal cross-modal similarity.
- Experiments on MSCOCO and nocaps for caption evaluation, and on MSCOCO and Flickr30k for image-to-text retrieval, show that the method can find a single hub text producing similarity scores comparable to or higher than human-written captions across many images.
- The results indicate a practical vulnerability in cross-modal encoder evaluation and retrieval pipelines, where automated metrics may be gamed by hub text.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!
Reddit r/artificial

Why Enterprise AI Pilots Fail
Dev.to

Announcing the NVIDIA Nemotron 3 Super Build Contest
Dev.to

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.
Dev.to