Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yor\`ub\'a

arXiv cs.CL / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies discrete speech units (DSUs) created by quantizing self-supervised learning (SSL) representations and finds that they encode suprasegmental features (like prosody) less reliably than segmental/phonetic structure.
Experiments on Mandarin and Yoruba suggest SSL latent spaces do contain tone information, but common DSU quantization methods (including K-means and alternatives) tend to prioritize phonetic structure, weakening lexical tone representation.
The authors conclude that current DSU quantization strategies have systematic limitations for suprasegmental features, implying broader issues for other prosody-related attributes.
They propose a potential improvement: cluster once to capture phonetic information and then cluster again on the residual representation to better encode lexical tone.

Abstract

Discrete speech units (DSUs) are derived by quantising representations from models trained using self-supervised learning (SSL). They are a popular representation for a wide variety of spoken language tasks, including those where prosody matters. DSUs are especially convenient for tasks where text and speech are jointly modelled, such as text-to-speech and multimodal dialogue systems. But we have found that DSUs encode suprasegmental information less reliably than segmental structure, which we demonstrate in this work using lexical tone, though this limitation likely extends to other suprasegmental features such as prosody. Our investigations using the tone languages Mandarin and Yor\`ub\'a show that the SSL latent representations themselves do encode tone, yet DSUs obtained using quantisation tend to prioritise phonetic structure, which makes lexical tone less reliably encoded. This remains true for a variety of quantisation methods, not only the most common, K-means. We conclude that current DSU quantisation strategies have limitations for suprasegmental features, which suggests a need for new, tone-aware (or prosody-aware) techniques in speech representation learning. We point towards a potential form of the solution by performing K-means clustering once to encode phonetic information, then again on the residual representation, which better encodes lexical tone.