From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers
arXiv cs.CL / 3/12/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- ARMADA is a cross-modal knowledge distillation framework that transfers knowledge from large vision-language models to language-only models without modifying the teacher or requiring expensive multimodal pre-training.
- It supports distilling from black-box vision-language models, enabling use of proprietary or inaccessible teachers without internal access.
- The authors evaluate ARMADA on twelve natural language understanding tasks, eight complex generative reasoning tasks, and five instruction-tuning tasks, showing consistent gains for large models such as DeBERTa-v2-1.4B, OPT-1.3B, and LLaMA-3B/7B/8B.
- It achieves up to 3.4% improvement on language understanding tasks and a 2.6% boost in generative reasoning, highlighting the efficiency and scalability of the approach.
- The work challenges traditional KD paradigms by demonstrating that vision-language models, even without explicit textual understanding, can meaningfully enhance language models when distilled appropriately, without requiring multimodal pre-training or teacher fine-tuning.
Related Articles

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO
Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide
Dev.to

The Research That Doesn't Exist
Dev.to

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI
TechCrunch

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap
Dev.to