Optimal Decay Spectra for Linear Recurrences

Abstract

Linear recurrent models offer linear-time sequence processing but often suffer from suboptimal long-range memory. We trace this to the decay spectrum: for

N

channels, random initialization collapses the minimum spectral gap to

O(N^{-2})

, yielding sub-exponential error

\exp(-\Omega(N/\log N))

; linear spacing avoids collapse but degrades to

\exp(-O(N/\sqrt{T}))

, practically algebraic over long contexts. We introduce Position-Adaptive Spectral Tapering (PoST), an architecture-agnostic framework combining two mechanisms: (1) Spectral Reparameterization, which structurally enforces geometrically spaced log-decay rates, proven minimax optimal at rate

O(\exp(-cN/\log T))

; and (2) Position-Adaptive Scaling, the provably unique mechanism that eliminates the scale mismatch of static spectra (where only

N\log t/\log T

N

channels are effective at position

t

) by stretching the spectrum to the actual dependency range, sharpening the rate to

O(\exp(-cN/\log t))

. This scaling natively induces fractional invariance: the impulse response becomes scale-free, with channels interpolating between relative and absolute temporal coordinates. PoST integrates into any diagonal linear recurrence without overhead. We instantiate it across Mamba-2, RWKV-7, Gated DeltaNet, Gated Linear Attention, and RetNet. Pre-training at 180M-440M scales shows consistent zero-shot language modeling improvements, significant long-context retrieval gains for Mamba-2 (MQAR and NIAH), and competitive or improved performance across other architectures. Code: https://github.com/SiLifen/PoST.

Optimal Decay Spectra for Linear Recurrences

Key Points

Abstract

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer