Caption Generation for Dongba Paintings via Prompt Learning and Semantic Fusion
arXiv cs.CV / 3/25/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses the difficulty of generating captions for Dongba paintings by introducing a domain-bridging approach that tackles the severe mismatch between generic image-captioning models and culturally specific Dongba imagery.
- It proposes PVGF-DPC, an encoder–decoder system that uses a MobileNetV2 visual encoder and a 10-layer Transformer decoder initialized with pretrained BERT weights, with mechanisms to steer generation toward culture-aware labels.
- A content prompt module maps extracted image features to thematic Dongba-related concepts (e.g., deity, ritual pattern, hell ghost) and forms a prompt that guides the decoder.
- The method adds a visual semantic-generation fusion loss that jointly optimizes objectives for both the prompt predictor and the caption generator to improve semantic alignment with the input.
- The authors also release a dedicated Dongba captioning dataset of 9,408 augmented images with annotations across seven thematic categories.
Related Articles
Santa Augmentcode Intent Ep.6
Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.
Dev.to
ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't
Dev.to
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’
Reddit r/artificial