SF-Mamba: Rethinking State Space Model for Vision

arXiv cs.CV / 3/18/2026

📰 NewsModels & Research

共有:

Key Points

SF-Mamba presents a vision-focused Mamba with two main innovations: auxiliary patch swapping to enable bidirectional information flow under a unidirectional scan, and batch folding with periodic state resets to boost GPU parallelism.
The approach is designed to deliver higher throughput and efficiency, outperforming state-of-the-art baselines across image classification, object detection, and instance/semantic segmentation at multiple model sizes.
It addresses limitations of prior Mamba variants and ViTs by enabling more efficient interaction among patches without relying on quadratic complexity or heavy data rearrangements.
The authors plan to release the source code after publication.

Abstract

The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non-causal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes. We will release the source code after publication.

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

THE DECODER

Kreuzberg v4.5.0: We loved Docling's model so much that we gave it a faster engine

Reddit r/LocalLLaMA

Today, what hardware to get for running large-ish local models like qwen 120b ?

Reddit r/LocalLLaMA

Running mistral locally for meeting notes and it's honestly good enough for my use case

Reddit r/LocalLLaMA

[D] Single-artist longitudinal fine art dataset spanning 5 decades now on Hugging Face — potential applications in style evolution, figure representation, and ethical training data

Reddit r/MachineLearning

SF-Mamba: Rethinking State Space Model for Vision

Key Points

Abstract

Related Articles

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

Kreuzberg v4.5.0: We loved Docling's model so much that we gave it a faster engine

Today, what hardware to get for running large-ish local models like qwen 120b ?

Running mistral locally for meeting notes and it's honestly good enough for my use case

[D] Single-artist longitudinal fine art dataset spanning 5 decades now on Hugging Face — potential applications in style evolution, figure representation, and ethical training data

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer