Elderly-Contextual Data Augmentation via Speech Synthesis for Elderly ASR
arXiv cs.CL / 4/29/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- The paper proposes an elderly-contextual data augmentation pipeline for elderly ASR (EASR) by combining LLM-based transcript paraphrasing with text-to-speech (TTS) synthesis using elderly reference speakers.
- Starting from an elderly speech dataset, the LLM generates elderly-contextual paraphrases, and the TTS model produces synthetic speech that is paired with those paraphrases to create new audio-text training examples.
- The synthetic and original data are merged to fine-tune Whisper without changing the model architecture, aiming to mitigate EASR’s limited-data and distinct speech characteristics.
- Experiments on English and Korean elderly datasets (70+ speakers) show consistent gains over conventional augmentation baselines, including up to a 58.2% WER reduction versus the Whisper baseline.
- The authors also study how augmentation ratio and the mix of reference speakers affect performance in low-resource EASR settings.
