AI Navigate

Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning

arXiv cs.CV / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • SF-CDFSL with vision-language models can suffer from a discriminability trap where strengthening visual discrimination harms cross-modal alignment and overall performance.
  • The paper analyzes how cross-entropy fine-tuning splits into visual learning and cross-modal learning, showing that the visual component can dominate and impede cross-modal alignment.
  • A two-step solution is proposed: perturb visual learning to bias the model toward cross-modal alignment, then gradually align visual and textual modalities using visual-text semantic relationships during fine-tuning.
  • Extensive experiments across multiple backbones (CLIP, SigLip, PE-Core) and datasets (4 CDFSL and 11 FSL) demonstrate consistent state-of-the-art results, with code released for replication.

Abstract

Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where Vision-Language Models (VLMs) such as CLIP and SigLIP have shown promising results. Current works in traditional visual models suggest that improving visual discriminability enhances performance. However, in VLM-based SF-CDFSL tasks, we find that \textbf{strengthening visual-modal discriminability actually suppresses VLMs' performance}. In this paper, we aim to delve into this phenomenon for an interpretation and a solution. By both theoretical and experimental proofs, our study reveals that fine-tuning with the typical cross-entropy loss (\mathcal{L}_{\mathrm{vlm}}) inherently includes a visual learning part and a cross-modal learning part, where the cross-modal part is crucial for rectifying the heavily disrupted modality misalignment in SF-CDFSL. However, we find that the visual learning essentially acts as a shortcut that encourages the model to reduce \mathcal{L}_{\mathrm{vlm}} without considering the cross-modal part, therefore hindering the cross-modal alignment and harming the performance. Based on this interpretation, we further propose an approach to address this problem: first, we perturb the visual learning to guide the model to focus on the cross-modal alignment. Then, we use the visual-text semantic relationships to gradually align the visual and textual modalities during the fine-tuning. Extensive experiments on various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 11 FSL datasets) show that we consistently set new state-of-the-art results. Code is available at https://github.com/zhenyuZ-HUST/CVPR26-Mind-the-Discriminability-Trap.