Pretext Matters: An Empirical Study of SSL Methods in Medical Imaging

arXiv cs.CV / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study evaluates how different self-supervised learning (SSL) objectives affect representation quality in medical imaging, focusing on joint embedding architectures (JEAs) and joint embedding predictive architectures (JEPAs) versus pixel reconstruction methods.
  • Using two modalities with distinct noise characteristics—ultrasound and histopathology—the authors find that the best SSL method depends on how clinically relevant signal is organized spatially.
  • For spatially localized informative signals in histopathology, JEAs outperform due to their view-invariance objective, while JEPAs are better for globally structured diagnostically relevant information such as liver ultrasound anatomy.
  • The conclusions are strengthened by independent validation from board-certified radiologists and pathologists, linking SSL objective choice to clinical relevance of learned features.
  • The paper proposes a practical framework for selecting SSL objectives that match the structural and noise properties of each medical imaging modality.

Abstract

Though self-supervised learning (SSL) has demonstrated incredible ability to learn robust representations from unlabeled data, the choice of optimal SSL strategy can lead to vastly different performance outcomes in specialized domains. Joint embedding architectures (JEAs) and joint embedding predictive architectures (JEPAs) have shown robustness to noise and strong semantic feature learning compared to pixel reconstruction-based SSL methods, leading to widespread adoption in medical imaging. However, no prior work has systematically investigated which SSL objective is better aligned with the spatial organization of clinically relevant signal. In this work, we empirically investigate how the choice of SSL method impacts the learned representations in medical imaging. We select two representative imaging modalities characterized by unique noise profiles: ultrasound and histopathology. When informative signal is spatially localized, as in histopathology, JEAs are more effective due to their view-invariance objective. In contrast, when diagnostically relevant information is globally structured, such as the macroscopic anatomy present in liver ultrasounds, JEPAs are optimal. These differences are especially evident in the clinical relevance of the learned features, as independently validated by board-certified radiologists and pathologists. Together, our results provide a framework for matching SSL objectives to the structural and noise properties of medical imaging modalities.