Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling
arXiv cs.CL / 4/3/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes Multimodal Depth Upscaling to adapt pre-trained text LLMs into speech language models by inserting new transformer layers into a frozen text LLM and training only those added layers on speech data.
- Experiments on SmolLM2-360M and SmolLM2-1.7B using 48k hours of English ASR data show that the method achieves ASR performance comparable to full fine-tuning while better preserving the model’s original text abilities.
- Compared with full fine-tuning and LoRA, depth up-scaling produces significantly less degradation of text capabilities while maintaining strong speech recognition quality.
- The authors further improve results by using E-Branchformer as the inserted layers, achieving ASR that matches or exceeds full fine-tuning on the larger model and reducing text degradation by over 75% while using 60% fewer trainable parameters.
Related Articles

Black Hat Asia
AI Business

Mistral raises $830M, 9fin hits unicorn status, and new Tech.eu Summit speakers unveiled
Tech.eu

ChatGPT costs $20/month. I built an alternative for $2.99.
Dev.to

OpenAI shifts to usage-based pricing for Codex in ChatGPT business plans
THE DECODER

Why I built an AI assistant that doesn't know who you are
Dev.to