TASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
arXiv cs.CL / 3/16/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- TASTE-S is a streamable extension of TASTE for spoken language modeling that reduces latency by integrating a CTC-based ASR module into the encoder for instant dual-modality encoding.
- The approach redesigns the unit decoder to enable on-the-fly decoding, enabling real-time streaming usage.
- With joint training, TASTE-S matches TASTE's performance while significantly reducing latency and supporting long-form encoding and decoding.
- It remains robust to transcription quality, indicating resilience to imperfect ASR outputs and improved practical usability for streaming SLM.




