MedSAD-CLIP: Supervised CLIP with Token-Patch Cross-Attention for Medical Anomaly Detection and Segmentation

arXiv cs.CV / 3/19/2026

📰 NewsModels & Research

共有:

Key Points

MedSAD-CLIP introduces a supervised adaptation of CLIP for medical anomaly detection and segmentation using Token-Patch Cross-Attention to improve lesion localization while preserving CLIP's generalization.
The approach uses lightweight image adapters and learnable prompt tokens to efficiently tailor the pretrained CLIP encoder to the medical domain with a limited amount of labeled abnormal data.
A Margin-based image-text Contrastive Loss is proposed to enhance discrimination between normal and abnormal representations at the global feature level.
Experiments on four datasets (Brain, Retina, Lung, Breast) show superior pixel-level segmentation and image-level classification compared with state-of-the-art methods, with code to be released.

Abstract

Medical anomaly detection (MAD) and segmentation play a critical role in assisting clinical diagnosis by identifying abnormal regions in medical images and localizing pathological regions. Recent CLIP-based studies are promising for anomaly detection in zero-/few-shot settings, and typically rely on global representations and weak supervision, often producing coarse localization and limited segmentation quality. In this work, we study supervised adaptation of CLIP for MAD under a realistic clinical setting where a limited yet meaningful amount of labeled abnormal data is available. Our model MedSAD-CLIP leverages fine-grained text-visual cues via the Token-Patch Cross-Attention(TPCA) to improve lesion localization while preserving the generalization capability of CLIP representations. Lightweight image adapters and learnable prompt tokens efficiently adapt the pretrained CLIP encoder to the medical domain while preserving its rich semantic alignment. Furthermore, a Margin-based image-text Contrastive Loss is designed to enhance global feature discrimination between normal and abnormal representations. Extensive experiments on four diverse benchmarks-Brain, Retina, Lung, and Breast datasets-demonstrate the effectiveness of our approach, achieving superior performance in both pixel-level segmentation and image-level classification over state-of-the-art methods. Our results highlight the potential of supervised CLIP adaptation as a unified and scalable paradigm for medical anomaly understanding. Code will be made available at https://github.com/thuy4tbn99/MedSAD-CLIP

Self-Refining Agents in Spec-Driven Development

Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

Reddit r/LocalLLaMA

M2.7 open weights coming in ~2 weeks

Reddit r/LocalLLaMA

MiniMax M2.7 Will Be Open Weights

Reddit r/LocalLLaMA

Best open source coding models for claude code? LB?

Reddit r/LocalLLaMA

MedSAD-CLIP: Supervised CLIP with Token-Patch Cross-Attention for Medical Anomaly Detection and Segmentation

Key Points

Abstract

Related Articles

Self-Refining Agents in Spec-Driven Development

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

M2.7 open weights coming in ~2 weeks

MiniMax M2.7 Will Be Open Weights

Best open source coding models for claude code? LB?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer