Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

arXiv cs.CL / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “Speaker-Reasoner,” an end-to-end Speech LLM designed for multi-speaker ASR that jointly performs speaker attribution, transcription, and timestamp localization in complex conversations.
  • Unlike single-pass approaches, the model uses iterative, agentic multi-turn temporal reasoning to infer global audio structure, predict temporal boundaries autonomously, and then run fine-grained segment analysis.
  • It jointly models speaker identity (including gender), transcription, and timestamps, targeting key failure modes such as overlapping speech, backchannels, and rapid turn-taking.
  • To handle inputs longer than its training context window, the system adds a speaker-aware cache that extends processing beyond the standard context limits.
  • Experiments on AliMeeting and AISHELL-4 show consistent gains over strong baselines, with particular improvements for overlapping speech and complex conversational dynamics.

Abstract

Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn-taking, and context window constraints. We propose Speaker-Reasoner, an end-to-end Speech LLM with agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window. Trained with a three-stage progressive strategy, Speaker-Reasoner achieves consistent improvements over strong baselines on AliMeeting and AISHELL-4 datasets, particularly in handling overlapping speech and complex turn-taking.