MambaBack: Bridging Local Features and Global Contexts in Whole Slide Image Analysis

arXiv cs.CV / 4/20/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MambaBack, a hybrid Multiple Instance Learning (MIL) architecture for Whole Slide Image (WSI) analysis that combines local cellular feature extraction with global context modeling.
  • It addresses key limitations of existing Mamba-based MIL approaches, including loss of 2D spatial locality from 1D flattening, weak modeling of fine-grained local structures, and high inference memory peaks on edge devices.
  • MambaBack preserves 2D tile relationships via a Hilbert sampling strategy, improving spatial perception in the resulting 1D sequence representation.
  • The model uses a hierarchical design with a 1D Gated CNN block (inspired by MambaOut) for local features and a BiMamba2 block for global context aggregation across multiple scales.
  • An asymmetric chunking mechanism enables parallel training and chunking-streaming inference, reducing peak memory usage, and experiments on five datasets show improved performance over seven state-of-the-art methods with code released publicly.

Abstract

Whole Slide Image (WSI) analysis is pivotal in computational pathology, enabling cancer diagnosis by integrating morphological and architectural cues across magnifications. Multiple Instance Learning (MIL) serves as the standard framework for WSI analysis. Recently, Mamba has become a promising backbone for MIL, overtaking Transformers due to its efficiency and global context modeling capabilities originating from Natural Language Processing (NLP). However, existing Mamba-based MIL approaches face three critical challenges: (1) disruption of 2D spatial locality during 1D sequence flattening; (2) sub-optimal modeling of fine-grained local cellular structures; and (3) high memory peaks during inference on resource-constrained edge devices. Studies like MambaOut reveal that Mamba's SSM component is redundant for local feature extraction, where Gated CNNs suffice. Recognizing that WSI analysis demands both fine-grained local feature extraction akin to natural images, and global context modeling akin to NLP, we propose MambaBack, a novel hybrid architecture that harmonizes the strengths of Mamba and MambaOut. First, we propose the Hilbert sampling strategy to preserve the 2D spatial locality of tiles within 1D sequences, enhancing the model's spatial perception. Second, we design a hierarchical structure comprising a 1D Gated CNN block based on MambaOut to capture local cellular features, and a BiMamba2 block to aggregate global context, jointly enhancing multi-scale representation. Finally, we implement an asymmetric chunking design, allowing parallel processing during training and chunking-streaming accumulation during inference, minimizing peak memory usage for deployment. Experimental results on five datasets demonstrate that MambaBack outperforms seven state-of-the-art methods. Source code and datasets are publicly available.