CNN-ViT Fusion with Adaptive Attention Gate for Brain Tumor MRI Classification: A Hybrid Deep Learning Model

arXiv cs.CV / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a new hybrid deep learning model for brain tumor MRI classification that fuses CNN and Vision Transformer (ViT) representations for better local-and-global feature learning.
  • It introduces an Adaptive Attention Gate that learns dynamic, per-sample and per-feature weighting to contextually merge the CNN (local texture/spatial) and transformer (long-range dependencies) branches.
  • The model is evaluated on the Kaggle Brain Tumor MRI Dataset and reports strong performance: 97.60% test accuracy, 97.30% precision, 97.50% recall, 97.40% F1-score, and 0.9946 macro-average AUC.
  • The authors state that the results outperform single CNN/ViT baselines and existing competitive fusion approaches, suggesting dynamic feature weighting improves medical image classification.
  • The work is shared as an arXiv preprint (v1), indicating an early-stage research contribution rather than a finalized, clinically validated system.

Abstract

Early detection and classifying brain tumors using Magnetic Resonance Imaging (MRI) images is highly important but difficult to extract in medical images. Convolutional Neural Networks (CNNs) are good at capturing both local texture and spatial information whereas Vision Transformers (ViTs) are good at capturing long-range global dependencies. We propose a new hybrid architecture that combines a SqueezeNet-style CNN branch with a MobileViT-style global transformer branch, through an Adaptive Attention Gate mechanism, in this paper. The gate learns dynamically per-sample, per-feature weights to weight the contribution of each branch, allowing context-sensitive merging of local and global representations. The proposed model has a test accuracy of 97.60, a precision of 97.30, a recall of 97.50, an F1-score of 97.40, and a macro-average area under the curve (AUC) of 0.9946 with a trained and evaluated on the Brain Tumor MRI Dataset (Kaggle). These scores are higher than single CNN and ViT baselines, and current competitive fusion methods, showing that dynamic feature weighting is an effective way to classify medical images.