AI Navigate

ProKWS: Personalized Keyword Spotting via Collaborative Learning of Phonemes and Prosody

arXiv cs.AI / 3/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • ProKWS introduces a dual-stream encoder that jointly learns phonemic representations and speaker-specific prosodic patterns, using a collaborative fusion module to combine both modalities.
  • The phoneme stream employs contrastive learning to enhance phonemic representations, while the prosody stream captures individual-speaking characteristics such as tone, stress, and rhythm.
  • The approach aims to improve adaptability across different acoustic environments and personalize keyword spotting for tone and intent variations.
  • Experiments indicate competitive performance with state-of-the-art models on standard benchmarks and robust handling of personalized keywords across diverse prosodic expressions.

Abstract

Current keyword spotting systems primarily use phoneme-level matching to distinguish confusable words but ignore user-specific pronunciation traits like prosody (intonation, stress, rhythm). This paper presents ProKWS, a novel framework integrating fine-grained phoneme learning with personalized prosody modeling. We design a dual-stream encoder where one stream derives robust phonemic representations through contrastive learning, while the other extracts speaker-specific prosodic patterns. A collaborative fusion module dynamically combines phonemic and prosodic information, enhancing adaptability across acoustic environments. Experiments show ProKWS delivers highly competitive performance, comparable to state-of-the-art models on standard benchmarks and demonstrates strong robustness for personalized keywords with tone and intent variations.