AI Navigate

PCOV-KWS: Multi-task Learning for Personalized Customizable Open Vocabulary Keyword Spotting

arXiv cs.AI / 3/20/2026

💬 OpinionModels & Research

Key Points

  • The PCOV-KWS paper introduces a multi-task learning framework for personalized, open-vocabulary keyword spotting (KWS) aimed at privacy-conscious, customizable voice interfaces in IoT, ASR, SV, and TTS contexts.
  • It uses a lightweight network to jointly perform Keyword Spotting and Speaker Verification, and replaces softmax-based loss with a training criterion that turns multi-class problems into multiple binary classifications to avoid inter-category competition.
  • An optimization strategy for multi-task loss weighting is employed during training, and the approach is evaluated across multiple datasets, demonstrating superiority over baselines while using fewer parameters and lower computational resources.
  • The work supports privacy-friendly, customized voice experiences and could enable more efficient on-device personalized KWS for consumer devices.

Abstract

As advancements in technologies like Internet of Things (IoT), Automatic Speech Recognition (ASR), Speaker Verification (SV), and Text-to-Speech (TTS) lead to increased usage of intelligent voice assistants, the demand for privacy and personalization has escalated. In this paper, we introduce a multi-task learning framework for personalized, customizable open-vocabulary Keyword Spotting (PCOV-KWS). This framework employs a lightweight network to simultaneously perform Keyword Spotting (KWS) and SV to address personalized KWS requirements. We have integrated a training criterion distinct from softmax-based loss, transforming multi-class classification into multiple binary classifications, which eliminates inter-category competition, while an optimization strategy for multi-task loss weighting is employed during training. We evaluated our PCOV-KWS system in multiple datasets, demonstrating that it outperforms the baselines in evaluation results, while also requiring fewer parameters and lower computational resources.