Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies an LLM-based phoneme-to-grapheme (P2G) approach for multilingual automatic speech recognition by factorizing ASR into speech-to-phoneme (S2P) and P2G modules.
  • It argues that multilingual P2G is difficult because language-aware text generation and cross-language data imbalance can degrade performance even when S2P is shared.
  • Using the CV-Lang10 benchmark (ten languages), the authors evaluate robustness strategies designed to handle uncertainty in the S2P outputs, including DANP and a simplified SKM variant (S-SKM).
  • S-SKM is presented as a Monte Carlo approximation that eliminates CTC-based S2P probability weighting during P2G training to improve training stability and effectiveness.
  • With robust training plus low-resource oversampling, the reported average WER improves from 10.56% to 7.66%, indicating a meaningful gains path for multilingual LLM-based P2G.

Abstract

Phoneme-based ASR factorizes recognition into speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G), enabling cross-lingual acoustic sharing while keeping language-specific orthography in a separate module. While large language models (LLMs) are promising for P2G, multilingual P2G remains challenging due to language-aware generation and severe cross-language data imbalance. We study multilingual LLM-based P2G on the ten-language CV-Lang10 benchmark. We examine robustness strategies that account for S2P uncertainty, including DANP and Simplified SKM (S-SKM). S-SKM is a Monte Carlo approximation that avoids CTC-based S2P probability weighting in P2G training. Robust training and low-resource oversampling reduce the average WER from 10.56% to 7.66%.

Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition | AI Navigate