MAE-Based Self-Supervised Pretraining for Data-Efficient Medical Image Segmentation Using nnFormer

arXiv cs.CV / 4/28/2026

📰 NewsModels & Research

Key Points

  • The paper proposes a data-efficient self-supervised pretraining method for nnFormer-based volumetric medical image segmentation using Masked Autoencoders (MAE).
  • It addresses the practical issue that transformer segmentation models often require large labeled datasets, risk overfitting, and can be unstable to train, which is costly in medical domains.
  • The method pretrains the model on abundant unlabeled 3D medical images by reconstructing randomly masked input regions to learn anatomical and structural representations.
  • The pretrained encoder is then fine-tuned on labeled data for the downstream segmentation task, improving performance (higher Dice scores), convergence speed, and generalization when labeled data is limited.
  • Overall, the results support self-supervised learning as a suitable approach to mitigate labeled-data scarcity in medical image analysis when paired with transformer-based segmentation architectures like nnFormer.

Abstract

Transformer architectures, including nnFormer,have demonstrated promising results in volumetric medical image segmentation by being able to capture long-range spatial interactions. Although they have high performance, these models need large quantities of labeled training data and are also likely to overfit and become training unstable. This is a serious practical problem because it is not only time-consuming but also expensive to obtain medical images that are annotated by experts. Moreover, fully supervised traditional training pipelines do not take advantage of the available large amounts of unlabeled medical imaging data that can be easily obtained in the clinics. We have solved these drawbacks by advancing the efficiency of the nnFormer with a self-supervised pretraining framework, which is based on the Masked Autoencoders (MAE). In this method, the model is pretrained on unlabeled volumetric medical images to reconstruct randomly masked parts of the input. This allows the encoder to learn meaningful anatomical and structural representations . The encoder is then further fine-tuned on a labeled dataset on the downstream segmentation task. Conducted Experiment shows that the offered method leads to a higher segmentation performance on the count of Dice score, a quicker convergence rate on the course of the fine-tuning procedure, and a superior generalization on the basis of limited labeled data . These findings validate that self-supervised learning combined with transformer-based segmentation models is an appropriate approach to the problem of data shortage in medical image analysis.