The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

arXiv cs.CL / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies why multimodal language models underperform on visual perception despite still modeling both text and vision, by introducing centroid replacement as a controlled probe of modal dependence.
  • Replacing (erasing) the text representation’s centroid structure hurts accuracy about 4× more than erasing the visual centroid structure, revealing a consistent imbalance where language dominates vision even on visually demanding tasks.
  • The authors exploit this asymmetry with “text centroid contrastive decoding,” improving accuracy by up to +16.9% on specific tasks using a reference decoded under text-centroid erasure.
  • The effectiveness of the intervention depends on training strategy: standard fine-tuned models benefit more (+5.6% average) than preference-optimized models (+1.5% average), suggesting structural differences in how modal competition is learned.
  • The results indicate modal competition is structurally localized and correctable at inference time without retraining, while also providing a diagnostic signal for designing future multimodal training.
  • Point 1

Abstract

Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4\times more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.