Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

arXiv cs.CL / 5/1/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper addresses Vietnamese scene-text image captioning by arguing that text cannot be treated as language-agnostic, because tones and diacritics change word meaning and OCR is error-prone.
It proposes HSTFG (Heterogeneous Scene-Text Fusion Graph), a graph-based multimodal fusion framework with learned spatial attention bias for integrating visual features, OCR text, and linguistic knowledge.
Topology analysis suggests that cross-modal graph edges can be harmful for scene-text fusion, leading to a specialized Vietnamese-focused design, PhonoSTFG (Phonological Scene-Text Fusion Graph).
The work introduces ViTextCaps, the first large-scale Vietnamese dataset for this task (15,729 images and 74,970 captions) and reports that 52.8% of the vocabulary is vulnerable to diacritic collision.

Abstract

Scene-text image captioning requires fusing three information streams -- visual features, OCR-detected text, and linguistic knowledge -- to generate descriptions that faithfully integrate text visible in images. Existing fusion approaches treat text as language-agnostic, which fails for Vietnamese: a tonal language where diacritics alter word meaning, OCR errors are pervasive, and word boundaries are ambiguous. We argue that Vietnamese scene-text captioning demands \textit{linguistically informed multimodal fusion}, where language-specific structural knowledge is explicitly incorporated into the fusion mechanism. Motivated from these insights, we propose \textbf{HSTFG} (Heterogeneous Scene-Text Fusion Graph), a general-purpose graph fusion framework with learned spatial attention bias, and show through topology analysis that cross-modal graph edges are harmful for scene-text fusion. Building on this finding, we design \textbf{PhonoSTFG} (Phonological Scene-Text Fusion Graph) which specializes graph-level fusion for Vietnamese linguistic reasoning. To support evaluation, we introduce \textbf{ViTextCaps}, the first large-scale Vietnamese scene-text captioning dataset (\textbf{15{,}729} images with \textbf{74{,}970} captions), with comprehensive linguistic analysis showing that 52.8\% of the vocabulary is at risk of diacritic collision.

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Reddit r/artificial

Automating FDA Compliance: AI for Specialty Food Producers

Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

THE DECODER

I hate this group but not literally

Reddit r/LocalLLaMA

Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

Key Points

Abstract

Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Automating FDA Compliance: AI for Specialty Food Producers

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

I hate this group but not literally

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer