Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images

arXiv cs.CV / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses adnexal mass segmentation in ultrasound cine images, highlighting that subjective interpretation and inter-observer variability make automated risk assessment difficult.
  • It proposes a label-efficient segmentation framework that adapts a pretrained DINOv3 vision transformer backbone with a DPT-style decoder to fuse global semantic priors and fine spatial details.
  • On a clinical dataset (7,777 frames from 112 patients), the method achieves state-of-the-art results versus convolutional fully supervised baselines, reporting a Dice score of 0.945 and improved boundary accuracy.
  • Compared with the strongest convolutional baseline, it reduces the 95th-percentile Hausdorff Distance by 11.4%, indicating better contour adherence.
  • Efficiency experiments show strong robustness under limited annotations, maintaining high performance even when trained with only 25% of the data, suggesting a practical approach for data-constrained medical settings.

Abstract

Adnexal mass evaluation via ultrasound is a challenging clinical task, often hindered by subjective interpretation and significant inter-observer variability. While automated segmentation is a foundational step for quantitative risk assessment, traditional fully supervised convolutional architectures frequently require large amounts of pixel-level annotations and struggle with domain shifts common in medical imaging. In this work, we propose a label-efficient segmentation framework that leverages the robust semantic priors of a pretrained DINOv3 foundational vision transformer backbone. By integrating this backbone with a Dense Prediction Transformer (DPT)-style decoder, our model hierarchically reassembles multi-scale features to combine global semantic representations with fine-grained spatial details. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, our method achieves state-of-the-art performance compared to established fully supervised baselines, including U-Net, U-Net++, DeepLabV3, and MAnet. Specifically, we obtain a Dice score of 0.945 and improved boundary adherence, reducing the 95th-percentile Hausdorff Distance by 11.4% relative to the strongest convolutional baseline. Furthermore, we conduct an extensive efficiency analysis demonstrating that our DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data. These results suggest that leveraging large-scale self-supervised foundations provides a promising and data-efficient solution for medical image segmentation in data-constrained clinical environments. Project Repository: https://github.com/FrancescaFati/MESA