GPAFormer: Graph-guided Patch Aggregation Transformer for Efficient 3D Medical Image Segmentation

arXiv cs.CV / 4/9/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces GPAFormer, a lightweight transformer-based architecture aimed at efficient and accurate 3D medical image segmentation across multiple modalities and organs.
  • GPAFormer’s design centers on two modules: MASA (multi-scale attention-guided stacked aggregation) for handling structures at different sizes, and MPGA (mutual-aware patch graph aggregator) for graph-guided aggregation using patch feature similarity and spatial adjacency.
  • Experiments on public whole-body CT/MRI datasets (BTCV, Synapse, ACDC, BraTS) report state-leading segmentation performance while using only 1.81M parameters.
  • The reported accuracy includes DSC improvements such as 75.70% on BTCV, 81.20% on Synapse, 89.32% on ACDC, and 82.74% on BraTS, indicating strong balance between performance and compactness.
  • The method is presented as practical for real settings, with sub-second inference on a consumer GPU for a validation case in BTCV, targeting resource-constrained clinical environments.

Abstract

Deep learning has been widely applied to 3D medical image segmentation tasks. However, due to the diversity of imaging modalities, the high-dimensional nature of the data, and the heterogeneity of anatomical structures, achieving both segmentation accuracy and computational efficiency in multi-organ segmentation remains a challenge. This study proposed GPAFormer, a lightweight network architecture specifically designed for 3D medical image segmentation, emphasizing efficiency while keeping high accuracy. GPAFormer incorporated two core modules: the multi-scale attention-guided stacked aggregation (MASA) and the mutual-aware patch graph aggregator (MPGA). MASA utilized three parallel paths with different receptive fields, combined through planar aggregation, to enhance the network's capability in handling structures of varying sizes. MPGA employed a graph-guided approach to dynamically aggregate regions with similar feature distributions based on inter-patch feature similarity and spatial adjacency, thereby improving the discrimination of both internal and boundary structures of organs. Experiments were performed on public whole-body CT and MRI datasets including BTCV, Synapse, ACDC, and BraTS. Compared to the existed 3D segmentation networkd, GPAFormer using only 1.81 M parameters achieved overall highest DSC on BTCV (75.70%), Synapse (81.20%), ACDC (89.32%), and BraTS (82.74%). Using consumer level GPU, the inference time for one validation case of BTCV spent less than one second. The results demonstrated that GPAFormer balanced accuracy and efficiency in multi-organ, multi-modality 3D segmentation tasks across various clinical scenarios especially for resource-constrained and time-sensitive clinical environments.