NEMESIS: Noise-suppressed Efficient MAE with Enhanced Superpatch Integration Strategy

arXiv cs.CV / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • NEMESIS is a masked autoencoder (MAE) framework for self-supervised learning on 3D CT volumes that uses local 128×128×128 “superpatches” to reduce memory demands while maintaining anatomical detail.
  • The method improves pretext learning with a noise-enhanced reconstruction task and uses Masked Anatomical Transformer Blocks (MATB) that apply dual masking via parallel plane-wise and axis-wise token removal.
  • It adds NEMESIS Tokens (NT) for cross-scale context aggregation to better capture anisotropic CT structure that conventional masking fails to represent well.
  • On the BTCV multi-organ benchmark, NEMESIS achieves 0.9633 mean AUROC with a frozen backbone plus linear classifier, outperforming fully fine-tuned SuPreM and VoCo.
  • In a low-label setting with only 10% annotations, it still reaches 0.9075 AUROC and significantly reduces compute (31.0 GFLOPs) versus a full-volume baseline (985.8 GFLOPs).

Abstract

Volumetric CT imaging is essential for clinical diagnosis, yet annotating 3D volumes is expensive and time-consuming, motivating self-supervised learning (SSL) from unlabeled data. However, applying SSL to 3D CT remains challenging due to the high memory cost of full-volume transformers and the anisotropic spatial structure of CT data, which is not well captured by conventional masking strategies. We propose NEMESIS, a masked autoencoder (MAE) framework that operates on local 128x128x128 superpatches, enabling memory-efficient training while preserving anatomical detail. NEMESIS introduces three key components: (i) noise-enhanced reconstruction as a pretext task, (ii) Masked Anatomical Transformer Blocks (MATB) that perform dual-masking through parallel plane-wise and axis-wise token removal, and (iii) NEMESIS Tokens (NT) for cross-scale context aggregation. On the BTCV multi-organ classification benchmark, NEMESIS with a frozen backbone and a linear classifier achieves a mean AUROC of 0.9633, surpassing fully fine-tuned SuPreM (0.9493) and VoCo (0.9387). Under a low-label regime with only 10% of available annotations, it retains an AUROC of 0.9075, demonstrating strong label efficiency. Furthermore, the superpatch-based design reduces computational cost to 31.0 GFLOPs per forward pass, compared to 985.8 GFLOPs for the full-volume baseline, providing a scalable and robust foundation for 3D medical imaging.