LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation

arXiv cs.CV / 4/2/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies that adapting pretrained language models (LMs) into vision-language models (VLMs) can significantly degrade the LM’s original linguistic ability due to representation shift and cross-modal interference.
  • It proposes LinguDistill, an adapter-free distillation approach that uses the original frozen LM as a teacher to recover linguistic capability without adding extra architectural modules or inference-time parameters.
  • To let the vision-conditioned student meaningfully supervise the teacher, the method introduces layer-wise KV-cache sharing so the teacher can be exposed to multimodal representations without changing either model’s architecture.
  • The authors selectively distill language-focused supervision on language-intensive data to regain language/knowledge performance while preserving strong visual grounding on multimodal tasks.
  • Results report recovering about ~10% of the performance lost on language and knowledge benchmarks while maintaining comparable performance on vision-heavy tasks, demonstrating a practical path to mitigate modality-specific degradation.

Abstract

Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers \sim10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.