Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling

arXiv cs.CL / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes Multimodal Depth Upscaling to adapt pre-trained text LLMs into speech language models by inserting new transformer layers into a frozen text LLM and training only those added layers on speech data.
Experiments on SmolLM2-360M and SmolLM2-1.7B using 48k hours of English ASR data show that the method achieves ASR performance comparable to full fine-tuning while better preserving the model’s original text abilities.
Compared with full fine-tuning and LoRA, depth up-scaling produces significantly less degradation of text capabilities while maintaining strong speech recognition quality.
The authors further improve results by using E-Branchformer as the inserted layers, achieving ASR that matches or exceeds full fine-tuning on the larger model and reducing text degradation by over 75% while using 60% fewer trainable parameters.

Abstract

Adapting pre-trained text Large Language Models (LLMs) into Speech Language Models (Speech LMs) via continual pretraining on speech data is promising, but often degrades the original text capabilities. We propose Multimodal Depth Upscaling, an extension of an emerging strategy in continual LLM pre-training, where new transformer layers are inserted into a frozen text LLM and only the added layers are trained on speech data. Experiments with SmolLM2-360M and SmolLM2-1.7B on 48k hours of English Automatic Speech Recognition (ASR) data show that depth up-scaling achieves ASR comparable to full fine-tuning while causing far less text degradation than both full fine-tuning and Low-Rank Adaptation (LoRA). We further show that incorporating E-Branchformer, an architecture designed for speech recognition, as the inserted layers achieves ASR that matches or surpasses full fine-tuning on the larger model while reducing text degradation by over 75% with 60% fewer trainable parameters.