Is the Modality Gap a Bug or a Feature? A Robustness Perspective

arXiv cs.CV / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper analyzes why multi-modal contrastive models (like CLIP-style VLMs) exhibit a “modality gap,” where image and text embeddings become strongly separated in the shared space.
  • It shows that, under specific conditions, minimizing contrastive loss produces a global gap vector that is orthogonal to the modality embeddings.
  • The authors connect this modality gap to robustness, finding that reducing the gap does not affect clean accuracy but increases output stability under embedding perturbations.
  • Experiments indicate that a simple post-processing adjustment—moving one modality’s embeddings toward the mean of the other—can significantly improve robustness across many real-world VLMs without sacrificing clean performance.

Abstract

Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss yields a representation in which the two modalities are separated by a global gap vector that is orthogonal to their embeddings. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when the embeddings are perturbed. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss of clean accuracy.