RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue

arXiv cs.AI / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces RelayS2S, a hybrid real-time speech-to-speech dialogue architecture designed to balance low latency and high semantic quality.
  • It runs two parallel paths after turn detection: a fast duplex S2S model speculatively streams a short response prefix, and a slower ASR→LLM pipeline generates a higher-quality continuation conditioned on that prefix.
  • A lightweight learned verifier decides whether to commit the speculative prefix or fall back to the slow path, aiming for seamless utterances without disrupting either component’s internal design.
  • Experiments report that RelayS2S matches S2S-level P90 audio onset latency while preserving ~99% of cascaded response quality on average, with advantages increasing as the slow-path model scales.
  • The authors claim RelayS2S is a “drop-in” addition to existing cascaded pipelines and provide public code/data via the linked GitHub repository.

Abstract

Real-time spoken dialogue systems face a fundamental tension between latency and response quality. End-to-end speech-to-speech (S2S) models respond immediately and naturally handle turn-taking, backchanneling, and interruption, but produce semantically weaker outputs. Cascaded pipelines (ASR -> LLM) deliver stronger responses at the cost of latency that grows with model size. We present RelayS2S, a hybrid architecture that runs two paths in parallel upon turn detection. The fast path -- a duplex S2S model -- speculatively drafts a short response prefix that is streamed immediately to TTS for low-latency audio onset, while continuing to monitor live audio events. The slow path -- a cascaded ASR -> LLM pipeline -- generates a higher-quality continuation conditioned on the committed prefix, producing a seamless utterance. A lightweight learned verifier gates the handoff, committing the prefix when appropriate or falling back gracefully to the slow path alone. Experiments show that RelayS2S achieves P90 onset latency comparable to the S2S model while retaining 99% cascaded response quality in average score, with benefits growing as the slow-path model scales. Because the prefix handoff requires no architectural modification to either component, RelayS2S serves as a lightweight, drop-in addition to existing cascaded pipelines. Our code and data are publicly available at: https://github.com/mailong25/relays2s