IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment

arXiv cs.CV / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

IsoCLIP investigates intra-modal misalignment in CLIP by focusing on the role of image and text projectors in mapping to the shared embedding space.
The authors distinguish an inter-modal operator that aligns modalities during training from an intra-modal operator that only enforces intra-modal normalization.
Spectral analysis reveals an approximately isotropic subspace where the two modalities align well, along with anisotropic directions specific to each modality.
They show that the aligned subspace can be derived directly from projector weights, and removing the anisotropic directions improves intra-modal alignment.
The training-free method reduces intra-modal misalignment, lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models, with the code released at the provided GitHub link.

Abstract

Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.

Interactive Web Visualization of GPT-2

Reddit r/artificial

Stop Treating AI Interview Fraud Like a Proctoring Problem

Dev.to

[R] Causal self-attention as a probabilistic model over embeddings

Reddit r/MachineLearning

The 5 software development trends that actually matter in 2026 (and what they mean for your startup)

Dev.to

InVideo AI Review: Fast Finished

Dev.to

IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment

Key Points

Abstract

Related Articles

Interactive Web Visualization of GPT-2

Stop Treating AI Interview Fraud Like a Proctoring Problem

[R] Causal self-attention as a probabilistic model over embeddings

The 5 software development trends that actually matter in 2026 (and what they mean for your startup)

InVideo AI Review: Fast Finished

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer