TASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

arXiv cs.CL / 3/16/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

TASTE-S is a streamable extension of TASTE for spoken language modeling that reduces latency by integrating a CTC-based ASR module into the encoder for instant dual-modality encoding.
The approach redesigns the unit decoder to enable on-the-fly decoding, enabling real-time streaming usage.
With joint training, TASTE-S matches TASTE's performance while significantly reducing latency and supporting long-form encoding and decoding.
It remains robust to transcription quality, indicating resilience to imperfect ASR outputs and improved practical usability for streaming SLM.

Abstract

Text-speech joint spoken language modeling (SLM) aims at natural and intelligent speech-based interactions, but developing such a system may suffer from modality mismatch: speech unit sequences are much longer than text tokens. Prior work reduces this gap with text-aligned tokenization and embedding (TASTE), producing speech tokens that align in lengths with their textual counterparts. However, the dependence on an external ASR system and the use of a non-causal decoder limits streaming use. To address this limitation, we propose TASTE-S, a streamable extension of TASTE suitable for real-time usage. TASTE-S integrates a CTC-based ASR module into the encoder for instant dual-modality encoding. We also redesign the unit decoder to enable on-the-fly decoding. With joint training, we show that TASTE-S matches TASTE's performance while significantly reducing latency. Further investigations reveal that TASTE-S remains robust to transcriptions and enables long-form encoding and decoding.

Self-Refining Agents in Spec-Driven Development

Dev.to

How to Optimize Your LinkedIn Profile with AI in 2026 (Get Found by Recruiters)

Dev.to

Agentforce Builder: How to Build AI Agents in Salesforce

Dev.to

How AI Consulting Services Support Staff Development in Dubai

Dev.to

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

Dev.to

TASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

Key Points

Abstract

Related Articles

Self-Refining Agents in Spec-Driven Development

How to Optimize Your LinkedIn Profile with AI in 2026 (Get Found by Recruiters)

Agentforce Builder: How to Build AI Agents in Salesforce

How AI Consulting Services Support Staff Development in Dubai

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer