Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention

arXiv cs.CL / 4/3/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Stochastic Attention (SA), a connectome-inspired technique that random-permutes token order before applying sliding-window attention and then restores the original order afterward.
  • SA effectively converts a fixed local window into a stochastic global routing mechanism while keeping the same per-layer computational budget of O(nw).
  • By sampling independent permutations across depth, SA yields exponentially expanding receptive fields, reaching full sequence coverage in O(log_w n) layers instead of O(n/w) for standard sliding-window attention.
  • Experiments show SA improves pre-training of language models (with gated SA+SWA performing best for average zero-shot accuracy) and boosts training-free inference on Qwen3-8B and Qwen3-30B-A3B, outperforming SWA and matching/exceeding Mixture of Block Attention under similar compute.
  • The authors argue that stochastic routing inspired by brain connectomics is a practical, drop-in attention primitive that complements existing efficient attention methods (linear/sparse).

Abstract

The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit level, the network's long-range connections are broadly distributed across brain regions, functioning as stochastic shortcuts that enable efficient global communication. Inspired by this observation, we propose Stochastic Attention (SA), a drop-in enhancement for sliding-window attention (SWA) that applies a random permutation to the token sequence before windowed attention and restores the original order afterward. This transforms the fixed local window into a stochastic global one within the same O(nw) per-layer budget. Through depth, independently sampled permutations yield exponentially growing receptive fields, achieving full sequence coverage in O(\log_w n) layers versus O(n/w) for SWA. We validate SA in two settings: pre-training language models from scratch, where a gated SA + SWA combination achieves the best average zero-shot accuracy, and training-free inference on Qwen3-8B and Qwen3-30B-A3B, where SA consistently outperforms SWA and matches or exceeds Mixture of Block Attention at comparable compute budgets. These results suggest that connectome-inspired stochastic routing is a practical primitive for improving the expressivity of efficient attention, complementary to existing linear and sparse approaches.