ConvVitMamba: Efficient Multiscale Convolution, Transformer, and Mamba-Based Sequence modelling for Hyperspectral Image Classification

arXiv cs.CV / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • Hyperspectral image (HSI) classification is difficult because of high spectral dimensionality, redundancy, and limited labeled data, motivating more efficient yet accurate sequence/spatial modeling.
  • The paper proposes ConvVitMamba, a hybrid architecture combining multiscale convolution for local spectral-spatial patterns, a Vision Transformer tokenization/encoding stage for global context, and a lightweight Mamba-inspired gated sequence-mixing module to avoid costly quadratic self-attention.
  • Principal Component Analysis (PCA) preprocessing is used to reduce spectral redundancy and improve overall efficiency.
  • Experiments on four benchmarks (including Houston and three UAV-based QUH datasets) show ConvVitMamba consistently outperforming CNN-, Transformer-, and Mamba-based approaches while keeping a favorable accuracy–model size–inference trade-off.
  • Ablation studies validate that each of the three components contributes complementarily to the final performance, and the authors release the source code publicly.

Abstract

Hyperspectral image (HSI) classification remains challenging due to high spectral dimensionality, redundancy, and limited labeled data. Although convolutional neural networks (CNNs) and Vision Transformers (ViTs) achieve strong performance by exploiting spectral-spatial information and long-range dependencies, they often incur high computational cost and large model size, limiting practical use. To address these limitations, a unified hybrid framework, termed ConvVitMamba, is proposed for efficient HSI classification. The architecture integrates three components: a multiscale convolutional feature extractor to capture local spectral, spatial, and joint patterns; a Vision Transformer based tokenization and encoding stage to model global contextual relationships; and a lightweight Mamba inspired gated sequence mixing module for efficient content-aware refinement without quadratic self-attention. Principal Component Analysis (PCA) is used as preprocessing to reduce redundancy and improve efficiency. Experiments on four benchmark datasets, including Houston and three UAV borne QUH datasets (Pingan, Qingyun, and Tangdaowan), demonstrate that ConvVitMamba consistently outperforms CNN, Transformer, and Mamba based methods while maintaining a favorable balance between accuracy, model size, and inference efficiency. Ablation studies confirm the complementary contributions of all components. The results indicate that the proposed framework provides an effective and efficient solution for HSI classification in diverse scenarios. The source code is publicly available at https://github.com/mqalkhatib/ConvVitMamba