Covertly improving intelligibility with data-driven adaptations of speech timing

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • 人が難聴者や第二言語話者に配慮するために行う「全体的なスピーチスローダウン」が、実際に聞き取り(語の理解)を改善するのか不明だった点に対し、本研究は生成音声の制御を用いて「タイミングに基づく局所的な調整」が有効かを検証した。
  • 逆相関実験の結果、特定の母音対立(例:tense-lax)を含む前後文脈において、発話速度の時間的影響が早い区間と遅い区間で逆向きに効く「ハサミのような(scissor-like)パターン」として現れ、話者内・話者間(ネイティブ/非ネイティブ)で安定していることが示された。
  • この速度パターンはL2話者の母音対立の理解を助けるだけでなく、ネイティブ話者でも困難な音響条件下で理解に利用されることがわかった。
  • さらに、この時間構造を再現するデータ駆動型のテキスト-to-スピーチ(TTS)アルゴリズムを構築し、標的な速度調整により聞き取りが改善する一方で、被験者はその改善に気づきにくいこと(むしろ全体スローダウンを「より明瞭」と判断するが誤りが増えること)を報告した。

Abstract

Human talkers often address listeners with language-comprehension challenges, such as hard-of-hearing or non-native adults, by globally slowing down their speech. However, it remains unclear whether this strategy actually makes speech more intelligible. Here, we take advantage of recent advancements in machine-generated speech allowing more precise control of speech rate in order to systematically examine how targeted speech-rate adjustments may improve comprehension. We first use reverse-correlation experiments to show that the temporal influence of speech rate prior to a target vowel contrast (ex. the tense-lax distinction) in fact manifests in a scissor-like pattern, with opposite effects in early versus late context windows; this pattern is remarkably stable both within individuals and across native L1-English listeners and L2-English listeners with French, Mandarin, and Japanese L1s. Second, we show that this speech rate structure not only facilitates L2 listeners' comprehension of the target vowel contrast, but that native listeners also rely on this pattern in challenging acoustic conditions. Finally, we build a data-driven text-to-speech algorithm that replicates this temporal structure on novel speech sequences. Across a variety of sentences and vowel contrasts, listeners remained unaware that such targeted slowing improved word comprehension. Strikingly, participants instead judged the common strategy of global slowing as clearer, even though it actually increased comprehension errors. Together, these results show that targeted adjustments to speech rate significantly aid intelligibility under challenging conditions, while often going unnoticed. More generally, this paper provides a data-driven methodology to improve the accessibility of machine-generated speech which can be extended to other aspects of speech comprehension and a wide variety of listeners and environments.