Excite, Attend and Segment (EASe): Domain-Agnostic Fine-Grained Mask Discovery with Feature Calibration and Self-Supervised Upsampling

arXiv cs.CV / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces EASe, an unsupervised, domain-agnostic semantic segmentation framework aimed at discovering fine-grained masks in scenes with complex, multi-component morphologies.
  • EASe improves upon coarse, patch-level mask discovery by working at pixel-level feature representations using Semantic-Aware Upsampling with Channel Excitation (SAUCE) to selectively calibrate low-resolution foundation-model features.
  • It further recovers full-resolution semantic structure via attention that integrates spatially encoded image features with foundation-model features.
  • For producing multi-granularity masks without extra training, EASe employs a training-free Cue-Attentive Feature Aggregator (CAFE) that uses SAUCE attention scores as semantic grouping signals.
  • Experiments report that EASe outperforms prior state-of-the-art unsupervised segmentation methods across multiple benchmarks and datasets, and the authors provide public code.

Abstract

Unsupervised segmentation approaches have increasingly leveraged foundation models (FM) to improve salient object discovery. However, these methods often falter in scenes with complex, multi-component morphologies, where fine-grained structural detail is indispensable. Many state-of-the-art unsupervised segmentation pipelines rely on mask discovery approaches that utilize coarse, patch-level representations. These coarse representations inherently suppress the fine-grained detail required to resolve such complex morphologies. To overcome this limitation, we propose Excite, Attend and Segment (EASe), an unsupervised domain-agnostic semantic segmentation framework for easy fine-grained mask discovery across challenging real-world scenes. EASe utilizes novel Semantic-Aware Upsampling with Channel Excitation (SAUCE) to excite low-resolution FM feature channels for selective calibration and attends across spatially-encoded image and FM features to recover full-resolution semantic representations. Finally, EASe segments the aggregated features into multi-granularity masks using a novel training-free Cue-Attentive Feature Aggregator (CAFE) which leverages SAUCE attention scores as a semantic grouping signal. EASe, together with SAUCE and CAFE, operate directly at pixel-level feature representations to enable accurate fine-grained dense semantic mask discovery. Our evaluation demonstrates superior performance of EASe over previous state-of-the-arts (SOTAs) across major standard benchmarks and diverse datasets with complex morphologies. Code is available at https://ease-project.github.io