Shaken or Stirred? An Analysis of MetaFormer's Token Mixing for Medical Imaging

arXiv cs.CV / 4/27/2026

💬 OpinionModels & Research

共有:

Key Points

The study offers a first comprehensive comparison of different token mixers (pooling-, convolution-, and attention-based) within the MetaFormer framework specifically for medical imaging tasks.
Experiments cover both image classification (global prediction) and semantic segmentation (dense prediction) across nine datasets, including seven 2D and two 3D modalities.
For classification, the paper finds that low-complexity token mixers such as grouped convolutions or pooling can be sufficient, mirroring conclusions from natural-image settings.
For segmentation, convolutional token mixers’ local inductive bias proves essential, with grouped convolutions emerging as the preferred option due to lower runtime and fewer parameters.
The work also evaluates transferring pretrained weights from natural images and shows that such pretraining can still help in certain cases even when switching to a new token mixer introduces a domain gap.

Abstract

The generalization of the Transformer architecture via MetaFormer has reshaped our understanding of its success in computer vision. By replacing self-attention with simpler token mixers, MetaFormer provides strong baselines for vision tasks. However, while extensively studied on natural image datasets, its use in medical imaging remains scarce, and existing works rarely compare different token mixers, potentially overlooking more suitable designs choices. In this work, we present the first comprehensive study of token mixers for medical imaging. We systematically analyze pooling-, convolution-, and attention-based token mixers within the MetaFormer architecture on image classification (global prediction task) and semantic segmentation (dense prediction task). Our evaluation spans nine datasets (seven 2D and two 3D) covering diverse modalities and common challenges in the medical domain. Given the prevalence of pretraining from natural images to mitigate medical data scarcity, we also examine transferring pretrained weights to new token mixers. Our results show that, for classification, low-complexity token mixers (e.g. grouped convolution or pooling) are sufficient, aligning with findings on natural images. Pretrained weights remain useful in some settings despite the domain gap introduced by the new token mixer. For segmentation, we find that the local inductive bias of convolutional token mixers is essential. Grouped convolutions emerge as the preferred choice, as they reduce runtime and parameter count compared to standard convolutions, while the MetaFormer's channel-MLPs already provide the necessary cross-channel interactions.