Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection

arXiv cs.CV / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes MDDCNet, a Mamba-based model for multi-scale traffic object detection that targets difficulties with small objects in cluttered scenes.
  • It enhances state-space modeling by combining hierarchical multi-scale deformable dilated convolution (MSDDC) blocks with Mamba blocks to better capture both local details and global semantics.
  • A Channel-Enhanced Feed-Forward Network (CE-FFN) is introduced to improve channel interactions, addressing limitations of conventional FFNs.
  • For stronger cross-scale fusion, the model uses a Mamba-based Attention-Aggregating Feature Pyramid Network (A^2FPN) to improve multi-scale feature aggregation.
  • Experiments on public benchmarks and real-world datasets report that MDDCNet outperforms multiple advanced detectors, and the authors provide code on GitHub.

Abstract

In a real-world traffic scenario, varying-scale objects are usually distributed in a cluttered background, which poses great challenges to accurate detection. Although current Mamba-based methods can efficiently model long-range dependencies, they still struggle to capture small objects with abundant local details, which hinders joint modeling of local structures and global semantics. Moreover, state-space models exhibit limited hierarchical feature representation and weak cross-scale interaction due to flat sequential modeling and insufficient spatial inductive biases, leading to sub-optimal performance in complex scenes. To address these issues, we propose a Mamba with Deformable Dilated Convolutions Network (MDDCNet) for accurate traffic object detection in this study. In MDDCNet, a well-designed hybrid backbone with successive Multi-Scale Deformable Dilated Convolution (MSDDC) blocks and Mamba blocks enables hierarchical feature representation from local details to global semantics. Meanwhile, a Channel-Enhanced Feed-Forward Network (CE-FFN) is further devised to overcome the limited channel interaction capability of conventional feed-forward networks, whilst a Mamba-based Attention-Aggregating Feature Pyramid Network (A^2FPN) is constructed to achieve enhanced multi-scale feature fusion and interaction. Extensive experimental results on public benchmark and real-world datasets demonstrate the superiority of our method over various advanced detectors. The code is available at https://github.com/Bettermea/MDDCNet.