SwinTextUNet: Integrating CLIP-Based Text Guidance into Swin Transformer U-Nets for Medical Image Segmentation

arXiv cs.CV / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SwinTextUNet, a multimodal medical image segmentation framework that injects CLIP-derived text embeddings into a Swin Transformer U-Net backbone for more robust performance under ambiguous or low-contrast visual conditions.
  • It integrates the textual guidance with hierarchical visual features using cross-attention and convolutional fusion, aligning semantic text cues with multi-scale representations.
  • Experiments on the QaTaCOV19 dataset show that a four-stage variant achieves Dice and IoU scores of 86.47% and 78.2%, respectively, balancing accuracy and complexity.
  • Ablation studies confirm that both text guidance and the multimodal fusion components are critical to the observed gains.
  • Overall, the work presents evidence that vision-language integration can improve segmentation quality in ways that may support clinically meaningful diagnostic tools.

Abstract

Precise medical image segmentation is fundamental for enabling computer aided diagnosis and effective treatment planning. Traditional models that rely solely on visual features often struggle when confronted with ambiguous or low contrast patterns. To overcome these limitations, we introduce SwinTextUNet, a multimodal segmentation framework that incorporates Contrastive Language Image Pretraining (CLIP), derived textual embeddings into a Swin Transformer UNet backbone. By integrating cross attention and convolutional fusion, the model effectively aligns semantic text guidance with hierarchical visual representations, enhancing robustness and accuracy. We evaluate our approach on the QaTaCOV19 dataset, where the proposed four stage variant achieves an optimal balance between performance and complexity, yielding Dice and IoU scores of 86.47% and 78.2%, respectively. Ablation studies further validate the importance of text guidance and multimodal fusion. These findings underscore the promise of vision language integration in advancing medical image segmentation and supporting clinically meaningful diagnostic tools.