A Multi-Dataset Benchmark of Multiple Instance Learning for 3D Neuroimage Classification

arXiv cs.LG / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper evaluates multiple instance learning (MIL) methods against 3D CNNs and 3D Vision Transformers for CT/MRI 3D neuroimage classification, using three CT and four MRI datasets (including two with 10,000+ scans).
It focuses on efficient deep MIL settings where the 2D image encoder can be frozen, training only the pooling mechanism and the classifier, aiming to help resource-constrained practitioners choose effective architectures.
Results show that simple mean-pooling MIL—without learnable attention—matches or outperforms more complex MIL variants and 3D CNN alternatives on 4 out of 6 moderate-sized tasks.
On the two large datasets, the mean-pooling baseline stays competitive while reportedly being up to 25× faster to train, indicating substantial practical efficiency gains.
The authors analyze why mean pooling works (including per-slice attention quality) and use a semi-synthetic dataset with Bayes-optimal estimates to identify limitations of current MIL approaches and suggest directions for future improvements.

Abstract

Despite being resource-intensive to train, 3D convolutional neural networks (CNNs) have been the standard approach to classify CT and MRI scans. Recent work suggests that deep multiple instance learning (MIL) may be a more efficient alternative for 3D brain scans, especially when the pre-trained image encoder used to embed each 2D slice is frozen and only the pooling operation and classifier are trained. In this paper, we provide a systematic comparison of simple MIL, attention-based MIL, 3D CNNs, and 3D ViTs across three CT and four MRI datasets, including two large datasets of at least 10,000 scans. Our goal is to help resource-constrained practitioners understand which neural networks work well for 3D neuroimages and why. We further compare design choices for attention-based MIL, including different encoders, pooling operations, and architectural orderings. We find that simple mean pooling MIL, without any learnable attention, matches or outperforms recent MIL or 3D CNN alternatives on 4 of 6 moderate-sized tasks. This baseline remains competitive on two large datasets while being 25x faster to train. To explain mean pooling's success, we examine per-slice attention quality and a semi-synthetic dataset where we can derive the best possible classifier via a Bayes estimator. This analysis reveals the limits of existing MIL approaches and suggests routes for future improvements.