Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models

arXiv cs.CV / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper reports a diagnostic finding that multimodal LLMs can experience visual representation degradation in middle layers, including loss in global function and patch structure compared to initial visual features.
  • It attributes the degradation to “visual sacrifice” caused by optimizing for a singular text-generation objective, leading the model to compromise visual fidelity to improve answer generation.
  • The authors propose Predictive Regularization (PRe), which trains intermediate degraded visual features to predict the initial visual features to preserve core visual attributes.
  • Experiments indicate that applying PRe mitigates visual degradation and yields measurable improvements in vision-language performance, supporting the need for both cross-modal reasoning and preserved visual competence.

Abstract

While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of the MLLM's internal representations. Extensive experiments confirm that mitigating this visual degradation effectively boosts vision-language performance, underscoring the critical importance of fostering robust internal visual representations within MLLMs for comprehensive multimodal understanding.