ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation
arXiv cs.CL / 3/12/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- ViDia2Std is introduced as the first manually annotated parallel corpus for dialect-to-standard Vietnamese translation that covers all 63 provinces, including Central, Southern, and non-standard Northern dialects.
- The dataset comprises over 13,000 sentence pairs from real-world Facebook comments, annotated by native speakers across all three dialect regions, with a semantic mapping agreement metric reporting 86% (North), 82% (Central), and 85% (South).
- Benchmark results show that mBART-large-50 achieves the best performance on ViDia2Std (BLEU 0.8166, ROUGE-L 0.9384, METEOR 0.8925), while ViT5-base offers competitive results with fewer parameters.
- The work demonstrates that dialect normalization substantially improves downstream NLP tasks and underscores the need for dialect-aware resources to build robust Vietnamese NLP systems.
Related Articles
The massive shift toward edge computing and local processing
Dev.to
Self-Refining Agents in Spec-Driven Development
Dev.to
Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs
Dev.to
The Three-Agent Protocol Is Transferable. The Discipline Isn't.
Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop
Reddit r/LocalLLaMA