Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating

arXiv cs.LG / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Gated-SwinRMT is a proposed hybrid vision-transformer approach that merges Swin’s shifted-window attention with Retentive Networks’ Manhattan-distance spatial decay, adding input-dependent gating for locality and selective output suppression.
The paper decomposes attention inside each shifted window into width-wise and height-wise retention passes, using per-head exponential decay masks as a 2D locality prior without learned positional biases.
Two variants are introduced: Gated-SwinRMT-SWAT replaces softmax with sigmoid and gates the value projection using SwiGLU, while Gated-SwinRMT-Retention uses softmax-normalized retention plus an explicit sigmoid gate applied after LCE and before the output projection.
Experiments on Mini-ImageNet and CIFAR-10 under the same training protocols show sizable gains over an RMT baseline on Mini-ImageNet, while CIFAR-10 benefits are much smaller due to adaptive windowing collapsing toward global attention.

Abstract

We introduce Gated-SwinRMT, a family of hybrid vision transformers that combine the shifted-window attention of the Swin Transformer with the Manhattan-distance spatial decay of Retentive Networks (RMT), augmented by input-dependent gating. Self-attention is decomposed into consecutive width-wise and height-wise retention passes within each shifted window, where per-head exponential decay masks provide a two-dimensional locality prior without learned positional biases. Two variants are proposed. \textbf{Gated-SwinRMT-SWAT} substitutes softmax with sigmoid activation, implements balanced ALiBi slopes with multiplicative post-activation spatial decay, and gates the value projection via SwiGLU; the Normalized output implicitly suppresses uninformative attention scores. \textbf{Gated-SwinRMT-Retention} retains softmax-normalized retention with an additive log-space decay bias and incorporates an explicit G1 sigmoid gate -- projected from the block input and applied after local context enhancement (LCE) but prior to the output projection~

W_O

-- to alleviate the low-rank

W_V \!\cdot\! W_O

bottleneck and enable input-dependent suppression of attended outputs. We assess both variants on Mini-ImageNet (

224{\times}224

, 100 classes) and CIFAR-10 (

32{\times}32

, 10 classes) under identical training protocols, utilizing a single GPU due to resource limitations. At

{\approx}77

79

\,M parameters, Gated-SwinRMT-SWAT achieves

80.22\%

and Gated-SwinRMT-Retention

78.20\%

top-1 test accuracy on Mini-ImageNet, compared with

73.74\%

for the RMT baseline. On CIFAR-10 -- where small feature maps cause the adaptive windowing mechanism to collapse attention to global scope -- the accuracy advantage compresses from

+6.48

\,pp to

+0.56

\,pp.

Black Hat Asia

AI Business

Meta's latest model is as open as Zuckerberg's private school

The Register

AI fuels global trade growth as China-US flows shift, McKinsey finds

SCMP Tech

Why multi-agent AI security is broken (and the identity patterns that actually work)

Dev.to

BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.

Reddit r/artificial

Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating

Key Points

Abstract

Related Articles

Black Hat Asia

Meta's latest model is as open as Zuckerberg's private school

AI fuels global trade growth as China-US flows shift, McKinsey finds

Why multi-agent AI security is broken (and the identity patterns that actually work)

BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer