ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation
arXiv cs.CL / 3/12/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- ViDia2Std is introduced as the first manually annotated parallel corpus for dialect-to-standard Vietnamese translation that covers all 63 provinces, including Central, Southern, and non-standard Northern dialects.
- The dataset comprises over 13,000 sentence pairs from real-world Facebook comments, annotated by native speakers across all three dialect regions, with a semantic mapping agreement metric reporting 86% (North), 82% (Central), and 85% (South).
- Benchmark results show that mBART-large-50 achieves the best performance on ViDia2Std (BLEU 0.8166, ROUGE-L 0.9384, METEOR 0.8925), while ViT5-base offers competitive results with fewer parameters.
- The work demonstrates that dialect normalization substantially improves downstream NLP tasks and underscores the need for dialect-aware resources to build robust Vietnamese NLP systems.



