In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads

arXiv cs.CL / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies in-context learning (ICL) in speech language models by using a Text-to-Speech (TTS) setup with demonstrations to test both content accuracy and acoustic imitation.
  • It finds that speaking rate is a major driver of ICL performance and is also reflected in the generated speech, while pitch range and intensity contribute little and are inconsistently reproduced.
  • The research analyzes how linguistic and acoustic factors influence the model’s ability to infer the task from examples and to mimic properties of the demonstration audio.
  • It further shows that induction heads have a causal role in speech-based ICL: ablating the top-k induction heads eliminates the model’s ICL capability, aligning with prior results from text-based models.

Abstract

In-Context Learning (ICL) has been extensively studied in text-only Language Models, but remains largely unexplored in the speech domain. Here, we investigate how linguistic and acoustic features affect ICL in Speech Language Models. We focus on the Text-to-Speech (TTS) task, which allows us to analyze ICL from two angles: (1) how accurately the model infers the task from the demonstrations (i.e., generating the correct spoken content), and (2) to what extent the model mimics the acoustic characteristics of the demonstration speech in its output. We find that speaking rate strongly affects ICL performance and is also mimicked in the output, whereas pitch range and intensity have little impact on performance and are not consistently reproduced. Finally, we investigate the role of induction heads in speech-based ICL and show that these heads play a causal role: ablating the top-k induction heads completely removes the model's ICL ability, mirroring findings from text-based ICL.