Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning
arXiv cs.CV / 3/24/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces a new multimodal task, Controllable Video Segmentation and Captioning (SegCaptioning), where users can prompt with localized cues (e.g., a bounding box) to generate both object masks and matching captions that reflect intent.
- It proposes SG-FSCFormer, a Scene Graph-guided Fine-grained SegCaptioning Transformer that uses a Prompt-guided Temporal Graph Former plus an adaptive prompt adaptor to better represent and follow user instructions over time.
- The method includes a Fine-grained Mask-linguistic Decoder that jointly predicts caption–mask pairs using a multi-entity contrastive loss.
- It adds fine-grained alignment between each predicted mask and its corresponding caption tokens to improve interpretability and user understanding.
- Experiments on two benchmark datasets show improved performance in capturing user intent and producing precise, prompt-specific multimodal outputs, with code released on GitHub.
Related Articles
Santa Augmentcode Intent Ep.6
Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.
Dev.to
ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't
Dev.to
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’
Reddit r/artificial