CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

arXiv cs.CV / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes cross-modality token modulation to better couple appearance and motion cues in two-stream architectures for unsupervised video object segmentation.
  • It builds dense connections between tokens from each modality and uses relation transformer blocks to propagate information both within and across modalities.
  • A token masking strategy is added to improve learning efficiency without simply increasing model complexity.
  • The method reportedly achieves state-of-the-art results across all public benchmarks, outperforming prior approaches.

Abstract

Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.