DenseSwinV2: Channel Attentive Dual Branch CNN Transformer Learning for Cassava Leaf Disease Classification

arXiv cs.AI / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces DenseSwinV2, a hybrid two-branch CNN–Transformer framework that combines DenseNet-style dense local feature learning with customized Swin Transformer V2 global contextual modeling for cassava leaf disease classification.
  • It uses shifted-window self-attention to capture long-range dependencies that help distinguish visually similar lesions, addressing challenges like occlusion, noise, and complex backgrounds.
  • Independent channel-squeeze attention modules are applied to each stream to amplify discriminative disease-related responses while suppressing redundant or background activations.
  • On a public cassava dataset with 31,000 images across five conditions (including normal), DenseSwinV2 reports 98.02% classification accuracy and an F1 score of 97.81%, outperforming established CNN and transformer baselines.
  • The results suggest the approach is robust and practical for field-level agricultural diagnosis where image quality is variable.

Abstract

This work presents a new Hybrid Dense SwinV2, a two-branch framework that jointly leverages densely connected convolutional features and hierarchical customized Swin Transformer V2 (SwinV2) representations for cassava disease classification. The proposed framework captures high resolution local features through its DenseNet branch, preserving the fine structural cues and also allowing for effective gradient flow. Concurrently, the customized SwinV2 models global contextual dependencies through the idea of shifted-window self attention, which enables the capture of long range interactions critical in distinguishing between visually similar lesions. Moreover, an attention channel-squeeze module is employed for each CNN Transformer stream independently to emphasize discriminative disease related responses and suppress redundant or background driven activations. Finally, these discriminative channels are fused to achieve refined representations from the dense local and SwinV2 global correlated strengthened feature maps, respectively. The proposed Dense SwinV2 utilized a public cassava leaf disease dataset of 31000 images, comprised of five diseases, including brown streak, mosaic, green mottle, bacterial blight, and normal leaf conditions. The proposed Dense SwinV2 demonstrates a significant classification accuracy of 98.02 percent with an F1 score of 97.81 percent, outperforming well-established convolutional and transformer models. These results underline the fact that Hybrid Dense SwinV2 offers robustness and practicality in the field level diagnosis of cassava disease and real world challenges related to occlusion, noise, and complex backgrounds.