Generalized Category Discovery under Domain Shifts: From Vision to Vision-Language Models

arXiv cs.CV / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Generalized Category Discovery (GCD) is studied under domain shifts, addressing the gap in prior work that typically assumes a single domain for unlabeled data.
The paper proposes three frameworks—HiLo, HLPrompt, and VLPrompt—that adapt foundation models from self-supervised vision backbones to vision-language models to handle both domain and semantic variation.
HiLo disentangles domain versus semantic features using multi-level feature extraction, mutual information minimization, and training strategies like PatchMix augmentation and curriculum sampling.
HLPrompt builds on HiLo with semantic-aware spatial prompt tuning to reduce the impact of background and domain noise during category discovery.
VLPrompt uses vision-language models with factorized textual prompts and cross-modal consistency regularization, achieving consistent gains on both synthetic corruptions and multi-domain real-world shift settings.

Abstract

Generalized Category Discovery (GCD) aims to categorize unlabelled instances from both known and unknown classes by transferring knowledge from labelled data of known classes. Existing methods assume all data comes from a single domain, yet real-world unlabelled data often exhibits domain shifts alongside semantic shifts. We study GCD under domain shifts and propose three frameworks that adapt foundation models, ranging from self-supervised vision models to vision-language models. (i) HiLo disentangles domain and semantic features through multi-level feature extraction and mutual information minimization, combined with PatchMix augmentation and curriculum sampling. (ii) HLPrompt extends HiLo with semantic-aware spatial prompt tuning to suppress background and domain noise. (iii) VLPrompt leverages vision-language models via factorized textual prompts and cross-modal consistency regularization. The three methods share core design principles while operating on different foundation backbones, making them suitable for different deployment scenarios. Extensive experiments on synthetic corruptions and real-world multi-domain shifts demonstrate consistent improvements over strong baselines. Project page: https://visual-ai.github.io/hilo/