From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

arXiv cs.CL / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

ARMADA is a cross-modal knowledge distillation framework that transfers knowledge from large vision-language models to language-only models without modifying the teacher or requiring expensive multimodal pre-training.
It supports distilling from black-box vision-language models, enabling use of proprietary or inaccessible teachers without internal access.
The authors evaluate ARMADA on twelve natural language understanding tasks, eight complex generative reasoning tasks, and five instruction-tuning tasks, showing consistent gains for large models such as DeBERTa-v2-1.4B, OPT-1.3B, and LLaMA-3B/7B/8B.
It achieves up to 3.4% improvement on language understanding tasks and a 2.6% boost in generative reasoning, highlighting the efficiency and scalability of the approach.
The work challenges traditional KD paradigms by demonstrating that vision-language models, even without explicit textual understanding, can meaningfully enhance language models when distilled appropriately, without requiring multimodal pre-training or teacher fine-tuning.

Abstract

Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability. We empirically validate ARMADA on twelve natural language understanding, eight complex generative reasoning and five instruction-tuning tasks, demonstrating consistent performance improvements in large models such as DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-{3B, 7B, 8B}. ARMADA achieves up to 3.4% improvement on language understanding tasks and 2.6% boost in generative reasoning, all without requiring expensive multimodal pre-training or fine-tuning of the teacher model. Our findings challenge conventional knowledge distillation paradigms by demonstrating that even vision-language models, despite lacking direct textual understanding, can significantly enhance language models when distilled appropriately.

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

Dev.to

The Research That Doesn't Exist

Dev.to

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

TechCrunch

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

Dev.to

From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

Key Points

Abstract

Related Articles

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

The Research That Doesn't Exist

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer