Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

arXiv cs.CV / 4/17/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces Switch-KD, a knowledge-distillation framework aimed at making vision-language models (VLMs) more efficient for resource-constrained deployment without increasing model size or data needs.
It argues that existing VLM distillation often supervises modalities separately, which can produce inconsistent multimodal knowledge transfer due to weak explicit multimodal alignment.
Switch-KD unifies vision-to-language and cross-modal knowledge transfer by using a “visual-switch” mechanism that routes the student’s visual outputs through the teacher’s language pathway to form cross-modal probabilistic references.
It also proposes the DBiLD loss, which adaptively aligns the most informative probability regions while preserving teacher/student distributional structure via dynamic bi-directional supervision.
Experiments show that a 0.5B TinyLLaVA distilled from a 3B teacher achieves an average +3.6 point improvement across 10 multimodal benchmarks without any architectural changes.

Abstract

Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1) Visual-Switch Distillation, which switches the student's visual outputs into the teacher's language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer; and (2) Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative probability regions while preserving the distributional structures of teacher and student through bidirectional supervision. Guided by Switch-KD, a 0.5B TinyLLaVA effectively distills rich multimodal knowledge from its 3B teacher, yielding an average improvement of 3.6 points across 10 multimodal benchmarks without any architectural modification.