ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation

arXiv cs.CL / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

ViDia2Std is introduced as the first manually annotated parallel corpus for dialect-to-standard Vietnamese translation that covers all 63 provinces, including Central, Southern, and non-standard Northern dialects.
The dataset comprises over 13,000 sentence pairs from real-world Facebook comments, annotated by native speakers across all three dialect regions, with a semantic mapping agreement metric reporting 86% (North), 82% (Central), and 85% (South).
Benchmark results show that mBART-large-50 achieves the best performance on ViDia2Std (BLEU 0.8166, ROUGE-L 0.9384, METEOR 0.8925), while ViT5-base offers competitive results with fewer parameters.
The work demonstrates that dialect normalization substantially improves downstream NLP tasks and underscores the need for dialect-aware resources to build robust Vietnamese NLP systems.

Abstract

Vietnamese exhibits extensive dialectal variation, posing challenges for NLP systems trained predominantly on standard Vietnamese. Such systems often underperform on dialectal inputs, especially from underrepresented Central and Southern regions. Previous work on dialect normalization has focused narrowly on Central-to-Northern dialect transfer using synthetic data and limited dialectal diversity. These efforts exclude Southern varieties and intra-regional variants within the North. We introduce ViDia2Std, the first manually annotated parallel corpus for dialect-to-standard Vietnamese translation covering all 63 provinces. Unlike prior datasets, ViDia2Std includes diverse dialects from Central, Southern, and non-standard Northern regions often absent from existing resources, making it the most dialectally inclusive corpus to date. The dataset consists of over 13,000 sentence pairs sourced from real-world Facebook comments and annotated by native speakers across all three dialect regions. To assess annotation consistency, we define a semantic mapping agreement metric that accounts for synonymous standard mappings across annotators. Based on this criterion, we report agreement rates of 86% (North), 82% (Central), and 85% (South). We benchmark several sequence-to-sequence models on ViDia2Std. mBART-large-50 achieves the best results (BLEU 0.8166, ROUGE-L 0.9384, METEOR 0.8925), while ViT5-base offers competitive performance with fewer parameters. ViDia2Std demonstrates that dialect normalization substantially improves downstream tasks, highlighting the need for dialect-aware resources in building robust Vietnamese NLP systems.

How to Build an AI Team: The Solopreneur Playbook

Dev.to

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

note

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

note

フリーランスの泥臭い経験を資産に変える。AIの文章に「あなたの魂」を注入する技術。【コピペOK】

note

諸葛亮孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話その肆拾伍『銀河文明･ダークマターエンジン』

note

ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation

Key Points

Abstract

Related Articles

How to Build an AI Team: The Solopreneur Playbook

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

フリーランスの泥臭い経験を資産に変える。AIの文章に「あなたの魂」を注入する技術。【コピペOK】

諸葛亮孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話その肆拾伍『銀河文明･ダークマターエンジン』

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

How to Build an AI Team: The Solopreneur Playbook

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

​報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

フリーランスの泥臭い経験を資産に変える。AIの文章に「あなたの魂」を注入する技術。【コピペOK】

諸葛亮 孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話 その肆拾伍『銀河文明･ダークマターエンジン』

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

諸葛亮孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話その肆拾伍『銀河文明･ダークマターエンジン』