AI Navigate

Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

arXiv cs.CL / 3/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents Hikari, a policy-free, end-to-end model for simultaneous speech-to-text translation and streaming transcription that encodes READ/WRITE decisions with a probabilistic WAIT token mechanism.
  • It introduces Decoder Time Dilation to reduce autoregressive overhead and balance training distribution, improving efficiency.
  • A supervised fine-tuning strategy trains the model to recover from delays, significantly improving the quality-latency trade-off.
  • Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores across both low- and high-latency regimes, outperforming recent baselines.

Abstract

Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.