The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

arXiv cs.CL / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies why multimodal language models underperform on visual perception despite still modeling both text and vision, by introducing centroid replacement as a controlled probe of modal dependence.
Replacing (erasing) the text representation’s centroid structure hurts accuracy about 4× more than erasing the visual centroid structure, revealing a consistent imbalance where language dominates vision even on visually demanding tasks.
The authors exploit this asymmetry with “text centroid contrastive decoding,” improving accuracy by up to +16.9% on specific tasks using a reference decoded under text-centroid erasure.
The effectiveness of the intervention depends on training strategy: standard fine-tuned models benefit more (+5.6% average) than preference-optimized models (+1.5% average), suggesting structural differences in how modal competition is learned.
The results indicate modal competition is structurally localized and correctable at inference time without retraining, while also providing a diagnostic signal for designing future multimodal training.
Point 1

Abstract

Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4

\times

more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.

langchain-anthropic==1.4.1

LangChain Releases

🚀 Anti-Gravity Meets Cloud AI: The Future of Effortless Development

Dev.to

Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs

Dev.to

AI Will Run Companies. Here's Why That Should Excite You, Not Scare You.

Dev.to

The problem with Big Tech AI pricing (and why 8 countries can't afford to compete)

Dev.to

The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

Key Points

Abstract

Related Articles

langchain-anthropic==1.4.1

🚀 Anti-Gravity Meets Cloud AI: The Future of Effortless Development

Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs

AI Will Run Companies. Here's Why That Should Excite You, Not Scare You.

The problem with Big Tech AI pricing (and why 8 countries can't afford to compete)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer