AI Navigate

MedSAD-CLIP: Supervised CLIP with Token-Patch Cross-Attention for Medical Anomaly Detection and Segmentation

arXiv cs.CV / 3/19/2026

📰 NewsModels & Research

Key Points

  • MedSAD-CLIP introduces a supervised adaptation of CLIP for medical anomaly detection and segmentation using Token-Patch Cross-Attention to improve lesion localization while preserving CLIP's generalization.
  • The approach uses lightweight image adapters and learnable prompt tokens to efficiently tailor the pretrained CLIP encoder to the medical domain with a limited amount of labeled abnormal data.
  • A Margin-based image-text Contrastive Loss is proposed to enhance discrimination between normal and abnormal representations at the global feature level.
  • Experiments on four datasets (Brain, Retina, Lung, Breast) show superior pixel-level segmentation and image-level classification compared with state-of-the-art methods, with code to be released.

Abstract

Medical anomaly detection (MAD) and segmentation play a critical role in assisting clinical diagnosis by identifying abnormal regions in medical images and localizing pathological regions. Recent CLIP-based studies are promising for anomaly detection in zero-/few-shot settings, and typically rely on global representations and weak supervision, often producing coarse localization and limited segmentation quality. In this work, we study supervised adaptation of CLIP for MAD under a realistic clinical setting where a limited yet meaningful amount of labeled abnormal data is available. Our model MedSAD-CLIP leverages fine-grained text-visual cues via the Token-Patch Cross-Attention(TPCA) to improve lesion localization while preserving the generalization capability of CLIP representations. Lightweight image adapters and learnable prompt tokens efficiently adapt the pretrained CLIP encoder to the medical domain while preserving its rich semantic alignment. Furthermore, a Margin-based image-text Contrastive Loss is designed to enhance global feature discrimination between normal and abnormal representations. Extensive experiments on four diverse benchmarks-Brain, Retina, Lung, and Breast datasets-demonstrate the effectiveness of our approach, achieving superior performance in both pixel-level segmentation and image-level classification over state-of-the-art methods. Our results highlight the potential of supervised CLIP adaptation as a unified and scalable paradigm for medical anomaly understanding. Code will be made available at https://github.com/thuy4tbn99/MedSAD-CLIP