Closing the Theory-Practice Gap in Spiking Transformers via Effective Dimension

arXiv cs.LG / 4/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces the first comprehensive expressivity theory for spiking self-attention, showing that spiking attention with Leaky Integrate-and-Fire neurons can universally approximate continuous permutation-equivariant functions.
  • It provides explicit spike-circuit constructions, including a novel lateral inhibition network that implements softmax normalization with provable convergence of order O(1/√T).
  • Using rate-distortion theory, the authors derive tight lower bounds on required spike counts for ε-approximation, showing a dependence of Ω(L_f^2 * n * d / ε^2) on task and approximation parameters.
  • The key insight is that the required number of timesteps depends on input-dependent “effective dimension,” with measured values d_eff = 47–89 (CIFAR/ImageNet) explaining why only T=4 timesteps can suffice in practice.
  • Experiments across Spikformer, QKFormer, and SpikingResformer for vision and language tasks support the theory, reporting strong fit (R^2=0.97, p<0.001) and calibrated design constants (C=2.3 with 95% CI [1.9, 2.7]).

Abstract

Spiking transformers achieve competitive accuracy with conventional transformers while offering 38-57\times energy efficiency on neuromorphic hardware, yet no theoretical framework guides their design. This paper establishes the first comprehensive expressivity theory for spiking self-attention. We prove that spiking attention with Leaky Integrate-and-Fire neurons is a universal approximator of continuous permutation-equivariant functions, providing explicit spike circuit constructions including a novel lateral inhibition network for softmax normalization with proven O(1/\sqrt{T}) convergence. We derive tight spike-count lower bounds via rate-distortion theory: \varepsilon-approximation requires \Omega(L_f^2 nd/\varepsilon^2) spikes, with rigorous information-theoretic derivation. Our key insight is input-dependent bounds using measured effective dimensions (d_{\text{eff}}=47--89 for CIFAR/ImageNet), explaining why T=4 timesteps suffice despite worst-case T \geq 10{,}000 predictions. We provide concrete design rules with calibrated constants (C=2.3, 95\% CI: [1.9, 2.7]). Experiments on Spikformer, QKFormer, and SpikingResformer across vision and language benchmarks validate predictions with R^2=0.97 (p<0.001). Our framework provides the first principled foundation for neuromorphic transformer design.