Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision
arXiv cs.CV / 3/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- MASS introduces mask-guided self-supervised learning for 3D medical images, treating in-context segmentation as the pretext task to learn general-purpose representations without annotated data.
- It relies on automatically generated class-agnostic masks to provide structural supervision, enabling the model to learn semantic definitions of medical structures through a holistic combination of appearance, shape, spatial context, and anatomical relationships.
- Across data regimes, MASS scales from small single-dataset pretraining to large multi-modal pretraining on 5K CT, MRI, and PET volumes, enabling few-shot segmentation on novel structures and achieving results that surpass self-supervised baselines by more than 20 Dice points when labels are scarce; it also shows that a frozen encoder can match fully supervised training for unseen pathologies with thousands of samples.
- Code is available on GitHub, making it possible to pursue 3D medical imaging foundation models without expert annotations.




