Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation

arXiv cs.CL / 3/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a segmental-level prosodic probing framework to test how well neural TTS models reproduce consonant-induced F0 perturbations tied to local articulatory mechanisms.
  • Experiments compare synthetic and natural speech across thousands of words stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same LJ Speech corpus.
  • Findings indicate accurate reproduction for high-frequency words, but weak generalization to low-frequency items, implying reliance on lexical-level memorization rather than abstract segmental-prosodic encoding.
  • The authors extend evaluation across multiple advanced TTS systems and propose the probe as a linguistically grounded diagnostic tool to improve TTS evaluation, interpretability, and synthetic speech authenticity assessment.

Abstract

This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models' ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems' ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.