The Geometry of Compromise: Unlocking Generative Capabilities via Controllable Modality Alignment

arXiv cs.AI / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes the “modality gap” in vision-language models (VLMs) like CLIP, showing it consists of two components: a Centroid Gap and a Distribution Gap.
It finds the Distribution Gap—not the commonly used Raw Gap—is the strongest predictor of cross-modal task quality (reported R² of 0.986 vs. 0.691).
To address this, the authors propose TPC-CMA, a fine-tuning framework that explicitly reduces both centroid offsets and distributional mismatch using a three-phase, gradient-aware training curriculum for stable optimization.
Experiments report substantial improvements in cross-modal alignment, including a 66.6% modality-gap reduction with minimal accuracy drop at α_target=0.05, and at stronger alignment (α_target=0.5) gains in clustering ARI and captioning CIDEr.
The authors plan to release code and pretrained models publicly after acceptance.

Abstract

Vision-Language Models (VLMs) such as CLIP learn a shared embedding space for images and text, yet their representations remain geometrically separated, a phenomenon known as the modality gap. This gap limits tasks requiring cross-modal interchangeability, such as captioning and joint clustering. Existing post-processing approaches can partially improve cross-modal compatibility; however, we show through geometric analysis that they primarily reduce the global centroid offset while leaving the underlying distributional mismatch intact. We decompose the modality gap into a Centroid Gap and a Distribution Gap, and demonstrate that the Distribution Gap is the true predictor of cross-modal task quality (

R^2 = 0.986

), whereas the commonly used Raw Gap is misleading (

R^2 = 0.691

). Motivated by this observation, we propose TPC-CMA (Three-Phase Curriculum for Cross-Modal Alignment), a fine-tuning framework that explicitly reduces both components. The proposed CMA jointly mitigates centroid offsets and reshapes the distributional structure, while a three-phase curriculum with gradient-aware scheduling progressively introduces alignment during training to enable stable optimization. Experiments demonstrate that our method significantly improves cross-modal alignment. With

\alpha_{\text{target}}{=}0.05

, the modality gap is reduced by 66.6\% with only 4.84\% accuracy drop. Under stronger alignment (

\alpha_{\text{target}}{=}0.5

), the gap is reduced by 82.3\%, clustering ARI improves from 0.318 to 0.516, and captioning CIDEr increases by 57.1\% over the original model. Our code and pre-trained models will be made publicly available upon acceptance.

Benchmarking Batch Deep Reinforcement Learning Algorithms

Dev.to

Qwen3.6-Plus: Alibaba's Quiet Giant in the AI Race Delivers a Million-Token Enterprise Powerhouse

Dev.to

How To Leverage AI for Back-Office Headcount Optimization

Dev.to

Is 1-bit and TurboQuant the future of OSS? A simulation for Qwen3.5 models.

Reddit r/LocalLLaMA

SOTA Language Models Under 14B?

Reddit r/LocalLLaMA

The Geometry of Compromise: Unlocking Generative Capabilities via Controllable Modality Alignment

Key Points

Abstract

Related Articles

Benchmarking Batch Deep Reinforcement Learning Algorithms

Qwen3.6-Plus: Alibaba's Quiet Giant in the AI Race Delivers a Million-Token Enterprise Powerhouse

How To Leverage AI for Back-Office Headcount Optimization

Is 1-bit and TurboQuant the future of OSS? A simulation for Qwen3.5 models.

SOTA Language Models Under 14B?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer