Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
arXiv cs.CV / 4/27/2026
💬 OpinionTools & Practical UsageModels & Research
Key Points
- The paper frames training-free video reasoning segmentation as a video QA task, extracting attention maps from an MLLM using an attention rollout mechanism.
- It argues that raw attention maps are noisy and misaligned with object regions, and introduces Decomposed Attention Fusion (DecAF) to refine them.
- DecAF improves localization via contrastive object–background fusion and complementary video-frame fusion to suppress irrelevant activations and strengthen object-focused cues.
- The approach converts refined attention maps into coarse segmentation masks and further uses attention-guided SAM2 prompting to produce fine-grained masks.
- Experiments on referring and reasoning VOS benchmarks show DecAF outperforms other training-free methods and matches training-based performance without any MLLM/SAM retraining.




