Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies an LLM-based phoneme-to-grapheme (P2G) approach for multilingual automatic speech recognition by factorizing ASR into speech-to-phoneme (S2P) and P2G modules.
It argues that multilingual P2G is difficult because language-aware text generation and cross-language data imbalance can degrade performance even when S2P is shared.
Using the CV-Lang10 benchmark (ten languages), the authors evaluate robustness strategies designed to handle uncertainty in the S2P outputs, including DANP and a simplified SKM variant (S-SKM).
S-SKM is presented as a Monte Carlo approximation that eliminates CTC-based S2P probability weighting during P2G training to improve training stability and effectiveness.
With robust training plus low-resource oversampling, the reported average WER improves from 10.56% to 7.66%, indicating a meaningful gains path for multilingual LLM-based P2G.

Abstract

Phoneme-based ASR factorizes recognition into speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G), enabling cross-lingual acoustic sharing while keeping language-specific orthography in a separate module. While large language models (LLMs) are promising for P2G, multilingual P2G remains challenging due to language-aware generation and severe cross-language data imbalance. We study multilingual LLM-based P2G on the ten-language CV-Lang10 benchmark. We examine robustness strategies that account for S2P uncertainty, including DANP and Simplified SKM (S-SKM). S-SKM is a Monte Carlo approximation that avoids CTC-based S2P probability weighting in P2G training. Robust training and low-resource oversampling reduce the average WER from 10.56% to 7.66%.

Knowledge Governance For The Agentic Economy.

Dev.to

AI server farms heat up the neighborhood for miles around, paper finds

The Register

Does the Claude “leak” actually change anything in practice?

Reddit r/LocalLLaMA

87.4% of My Agent's Decisions Run on a 0.8B Model

Dev.to

AIエージェントをソフトウェアチームに変える無料ツール「Paperclip」

Dev.to

Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition

Key Points

Abstract

Related Articles

Knowledge Governance For The Agentic Economy.

AI server farms heat up the neighborhood for miles around, paper finds

Does the Claude “leak” actually change anything in practice?

87.4% of My Agent's Decisions Run on a 0.8B Model

AIエージェントをソフトウェアチームに変える無料ツール「Paperclip」

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer