Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

arXiv cs.CV / 4/27/2026

💬 OpinionTools & Practical UsageModels & Research

共有:

Key Points

The paper frames training-free video reasoning segmentation as a video QA task, extracting attention maps from an MLLM using an attention rollout mechanism.
It argues that raw attention maps are noisy and misaligned with object regions, and introduces Decomposed Attention Fusion (DecAF) to refine them.
DecAF improves localization via contrastive object–background fusion and complementary video-frame fusion to suppress irrelevant activations and strengthen object-focused cues.
The approach converts refined attention maps into coarse segmentation masks and further uses attention-guided SAM2 prompting to produce fine-grained masks.
Experiments on referring and reasoning VOS benchmarks show DecAF outperforms other training-free methods and matches training-based performance without any MLLM/SAM retraining.

Abstract

Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks.