MAESIL: Masked Autoencoder for Enhanced Self-supervised Medical Image Learning

arXiv cs.CV / 4/2/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MAESIL, a new self-supervised learning framework for 3D medical imaging (especially CT) that targets the lack of labeled data.
  • It argues that common SSL approaches degrade 3D structural learning by treating CT volumes as independent 2D slices, discarding axial coherence and spatial context.
  • MAESIL’s key contribution is the “superpatch,” a 3D chunk-based input unit that aims to preserve 3D context while keeping computation manageable.
  • The method uses a 3D masked autoencoder with a dual-masking strategy to learn richer spatial representations from unlabeled scans.
  • Experiments on three large public CT datasets show MAESIL improves reconstruction quality (e.g., PSNR and SSIM) over baselines like AE, VAE, and VQ-VAE, positioning it as a practical pre-training option for downstream 3D tasks.

Abstract

Training deep learning models for three-dimensional (3D) medical imaging, such as Computed Tomography (CT), is fundamentally challenged by the scarcity of labeled data. While pre-training on natural images is common, it results in a significant domain shift, limiting performance. Self-Supervised Learning (SSL) on unlabeled medical data has emerged as a powerful solution, but prominent frameworks often fail to exploit the inherent 3D nature of CT scans. These methods typically process 3D scans as a collection of independent 2D slices, an approach that fundamentally discards critical axial coherence and the 3D structural context. To address this limitation, we propose the autoencoder for enhanced self-supervised medical image learning(MAESIL), a novel self-supervised learning framework designed to capture 3D structural information efficiently. The core innovation is the 'superpatch', a 3D chunk-based input unit that balances 3D context preservation with computational efficiency. Our framework partitions the volume into superpatches and employs a 3D masked autoencoder strategy with a dual-masking strategy to learn comprehensive spatial representations. We validated our approach on three diverse large-scale public CT datasets. Our experimental results show that MAESIL demonstrates significant improvements over existing methods such as AE, VAE and VQ-VAE in key reconstruction metrics such as PSNR and SSIM. This establishes MAESIL as a robust and practical pre-training solution for 3D medical imaging tasks.