Caption Generation for Dongba Paintings via Prompt Learning and Semantic Fusion

arXiv cs.CV / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the difficulty of generating captions for Dongba paintings by introducing a domain-bridging approach that tackles the severe mismatch between generic image-captioning models and culturally specific Dongba imagery.
  • It proposes PVGF-DPC, an encoder–decoder system that uses a MobileNetV2 visual encoder and a 10-layer Transformer decoder initialized with pretrained BERT weights, with mechanisms to steer generation toward culture-aware labels.
  • A content prompt module maps extracted image features to thematic Dongba-related concepts (e.g., deity, ritual pattern, hell ghost) and forms a prompt that guides the decoder.
  • The method adds a visual semantic-generation fusion loss that jointly optimizes objectives for both the prompt predictor and the caption generator to improve semantic alignment with the input.
  • The authors also release a dedicated Dongba captioning dataset of 9,408 augmented images with annotations across seven thematic categories.

Abstract

Dongba paintings, the treasured pictorial legacy of the Naxi people in southwestern China, feature richly layered visual elements, vivid color palettes, and pronounced ethnic and regional cultural symbolism, yet their automatic textual description remains largely unexplored owing to severe domain shift when mainstream captioning models are applied directly. This paper proposes \textbf{PVGF-DPC} (\textit{Prompt and Visual Semantic-Generation Fusion-based Dongba Painting Captioning}), an encoder-decoder framework that integrates a content prompt module with a novel visual semantic-generation fusion loss to bridge the gap between generic natural-image captioning and the culturally specific imagery found in Dongba art. A MobileNetV2 encoder extracts discriminative visual features, which are injected into the layer normalization of a 10-layer Transformer decoder initialized with pretrained BERT weights; meanwhile, the content prompt module maps the image feature vector to culture-aware labels -- such as \emph{deity}, \emph{ritual pattern}, or \emph{hell ghost} -- and constructs a post-prompt that steers the decoder toward thematically accurate descriptions. The visual semantic-generation fusion loss jointly optimizes the cross-entropy objectives of both the prompt predictor and the caption generator, encouraging the model to extract key cultural and visual cues and to produce captions that are semantically aligned with the input image. We construct a dedicated Dongba painting captioning dataset comprising 9{}408 augmented images with culturally grounded annotations spanning seven thematic categories.