Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean

arXiv cs.CL / 4/22/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the challenge of mapping continuous F0 pitch contours in real speech to discrete tonal categories in Seoul Korean under the Autosegmental-Metrical (AM) framework.
  • It introduces Dual-Glob, a deep supervised contrastive learning method that aligns clean and augmented F0 views in a shared latent space to preserve holistic contour structure.
  • The study builds a new large-scale benchmark dataset with 10,093 manually annotated Accentual Phrases in Seoul Korean to support fine-grained pitch accent classification.
  • Experimental results indicate Dual-Glob achieves state-of-the-art performance, reaching 77.75% accuracy and an F1-score of 51.54%, outperforming strong baseline models.
  • Overall, the work demonstrates that deep contrastive learning can capture robust, holistic structural features of continuous F0 contours while supporting AM-based intonational phonology.

Abstract

The intonational structure of Seoul Korean has been defined with discrete tonal categories within the Autosegmental-Metrical model of intonational phonology. However, it is challenging to map continuous F_0 contours to these invariant categories due to variable F_0 realizations in real-world speech. Our paper proposes Dual-Glob, a deep supervised contrastive learning framework to robustly classify fine-grained pitch accent patterns in Seoul Korean. Unlike conventional local predictive models, our approach captures holistic F_0 contour shapes by enforcing structural consistency between clean and augmented views in a shared latent space. To this aim, we introduce the first large-scale benchmark dataset, consisting of manually annotated 10,093 Accentual Phrases in Seoul Korean. Experimental results show that our Dual-Glob significantly outperforms strong baseline models with state-of-the-art accuracy (77.75%) and F1-score (51.54%). Therefore, our work supports AM-based intonational phonology using data-driven methodology, showing that deep contrastive learning effectively captures holistic structural features of continuous F_0 contours.