Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning

arXiv cs.CV / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a new multimodal task, Controllable Video Segmentation and Captioning (SegCaptioning), where users can prompt with localized cues (e.g., a bounding box) to generate both object masks and matching captions that reflect intent.
  • It proposes SG-FSCFormer, a Scene Graph-guided Fine-grained SegCaptioning Transformer that uses a Prompt-guided Temporal Graph Former plus an adaptive prompt adaptor to better represent and follow user instructions over time.
  • The method includes a Fine-grained Mask-linguistic Decoder that jointly predicts caption–mask pairs using a multi-entity contrastive loss.
  • It adds fine-grained alignment between each predicted mask and its corresponding caption tokens to improve interpretability and user understanding.
  • Experiments on two benchmark datasets show improved performance in capturing user intent and producing precise, prompt-specific multimodal outputs, with code released on GitHub.

Abstract

Recent advancements in multimodal large models have significantly bridged the representation gap between diverse modalities, catalyzing the evolution of video multimodal interpretation, which enhances users' understanding of video content by generating correlated modalities. However, most existing video multimodal interpretation methods primarily concentrate on global comprehension with limited user interaction. To address this, we propose a novel task, Controllable Video Segmentation and Captioning (SegCaptioning), which empowers users to provide specific prompts, such as a bounding box around an object of interest, to simultaneously generate correlated masks and captions that precisely embody user intent. An innovative framework Scene Graph-guided Fine-grained SegCaptioning Transformer (SG-FSCFormer) is designed that integrates a Prompt-guided Temporal Graph Former to effectively captures and represents user intent through an adaptive prompt adaptor, ensuring that the generated content well aligns with the user's requirements. Furthermore, our model introduces a Fine-grained Mask-linguistic Decoder to collaboratively predict high-quality caption-mask pairs using a Multi-entity Contrastive loss, as well as provide fine-grained alignment between each mask and its corresponding caption tokens, thereby enhancing users' comprehension of videos. Comprehensive experiments conducted on two benchmark datasets demonstrate that SG-FSCFormer achieves remarkable performance, effectively capturing user intent and generating precise multimodal outputs tailored to user specifications. Our code is available at https://github.com/XuZhang1211/SG-FSCFormer.