HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet

arXiv cs.CV / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces HAMSA, a scanning-free Vision State Space Model that operates directly in the spectral (frequency) domain to avoid the architectural complexity and overhead of 2D-to-sequence scanning strategies used by existing SSMs like Vim, VMamba, and SiMBA.
HAMSA simplifies kernel parameterization by using a single Gaussian-initialized complex kernel instead of the traditional (A, B, C) matrix setup, aiming to remove discretization instabilities.
It proposes SpectralPulseNet (SPN), an input-dependent frequency gating mechanism for adaptive spectral modulation, and a Spectral Adaptive Gating Unit (SAGU) that uses magnitude-based gating to stabilize gradient flow in the frequency domain.
Using FFT-based convolution, HAMSA eliminates sequential scanning and achieves O(L log L) complexity, reaching 85.7% top-1 accuracy on ImageNet-1K and reporting faster inference and lower memory/energy than both transformer baselines and scanning-based SSMs, with strong transfer and dense prediction generalization.

Abstract

Vision State Space Models (SSMs) like Vim, VMamba, and SiMBA rely on complex scanning strategies to adapt sequential SSMs to process 2D images, introducing computational overhead and architectural complexity. We propose HAMSA, a scanning-free SSM operating directly in the spectral domain. HAMSA introduces three key innovations: (1) simplified kernel parameterization-a single Gaussian-initialized complex kernel replacing traditional (A, B, C) matrices, eliminating discretization instabilities; (2) SpectralPulseNet (SPN)-an input-dependent frequency gating mechanism enabling adaptive spectral modulation; and (3) Spectral Adaptive Gating Unit (SAGU)-magnitude-based gating for stable gradient flow in the frequency domain. By leveraging FFT-based convolution, HAMSA eliminates sequential scanning while achieving O(L log L) complexity with superior simplicity and efficiency. On ImageNet-1K, HAMSA reaches 85.7% top-1 accuracy (state-of-the-art among SSMs), with 2.2 X faster inference than transformers (4.2ms vs 9.2ms for DeiT-S) and 1.4-1.9X speedup over scanning-based SSMs, while using less memory (2.1GB vs 3.2-4.5GB) and energy (12.5J vs 18-25J). HAMSA demonstrates strong generalization across transfer learning and dense prediction tasks.