Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention
arXiv cs.CL / 5/1/2026
💬 OpinionDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper addresses Vietnamese scene-text image captioning by arguing that text cannot be treated as language-agnostic, because tones and diacritics change word meaning and OCR is error-prone.
- It proposes HSTFG (Heterogeneous Scene-Text Fusion Graph), a graph-based multimodal fusion framework with learned spatial attention bias for integrating visual features, OCR text, and linguistic knowledge.
- Topology analysis suggests that cross-modal graph edges can be harmful for scene-text fusion, leading to a specialized Vietnamese-focused design, PhonoSTFG (Phonological Scene-Text Fusion Graph).
- The work introduces ViTextCaps, the first large-scale Vietnamese dataset for this task (15,729 images and 74,970 captions) and reports that 52.8% of the vocabulary is vulnerable to diacritic collision.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!
Reddit r/artificial

Automating FDA Compliance: AI for Specialty Food Producers
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER
I hate this group but not literally
Reddit r/LocalLLaMA