Selective Augmentation: Improving Universal Automatic Phonetic Transcription via G2P Bootstrapping

arXiv cs.CL / 5/1/2026

📰 NewsModels & Research

Key Points

  • The paper proposes “Selective Augmentation,” a bootstrapping method to improve universal automatic phonetic transcription (APT) when high-quality, diverse training transcriptions are scarce.
  • Using a MultIPA-based setup, the authors selectively transfer phonetic distinctions from a helper language (Hindi) to augment an existing training dataset for a target language (German).
  • The method improves plosive voicing accuracy by reducing false positives, yielding a reported 17.6% absolute gain.
  • It also enables new capability by introducing plosive aspiration recognition, moving from 0% aspirated transcriptions to 61.2% for German /p, t, k/.
  • The paper addresses evaluation difficulties by developing objective metrics, including reducing the tenuis class by 32.2% to lower confusions among the target language’s plosives.

Abstract

In the field of universal automatic phonetic transcription (APT), clean and diverse training transcriptions are required. However, such high-quality data is limited. We propose the bootstrapping approach Selective Augmentation to improve the available training transcriptions by selectively transferring distinctions between languages. Based on the model MultIPA, we exemplarily show that we could increase the accuracy of an existing feature (plosive voicing) and add a new feature (plosive aspiration) by augmenting the existing training data using information from a separate helper language (Hindi). We describe intrinsic challenges of the evaluation and develop objective metrics to determine the success: Voicing accuracy was increased by 17.6% by reducing the number of false positives. Additionally, aspiration recognition was introduced: While the baseline transcribed 0% of German /p, t, k/ as aspirated, our approach transcribed them as aspirated in 61.2% of the cases. Introducing aspiration recognition to APT models allowed for the tenuis class to be successfully reduced by 32.2%, which also reduces the conflations between the test language's plosives.