Is the Modality Gap a Bug or a Feature? A Robustness Perspective

arXiv cs.CV / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes why multi-modal contrastive models (like CLIP-style VLMs) exhibit a “modality gap,” where image and text embeddings become strongly separated in the shared space.
It shows that, under specific conditions, minimizing contrastive loss produces a global gap vector that is orthogonal to the modality embeddings.
The authors connect this modality gap to robustness, finding that reducing the gap does not affect clean accuracy but increases output stability under embedding perturbations.
Experiments indicate that a simple post-processing adjustment—moving one modality’s embeddings toward the mean of the other—can significantly improve robustness across many real-world VLMs without sacrificing clean performance.

Abstract

Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss yields a representation in which the two modalities are separated by a global gap vector that is orthogonal to their embeddings. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when the embeddings are perturbed. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss of clean accuracy.