ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers

arXiv cs.LG / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The paper introduces ELSA, an algorithmic reformulation of online softmax attention that preserves exact softmax semantics while providing a provable O(u log n) FP32 relative error bound.
  • ELSA reformulates the online softmax update as a prefix scan over an associative monoid, achieving O(n) extra memory and O(log n) parallel depth, enabling faster long-sequence attention.
  • Unlike FlashAttention-style kernels that depend on specific Tensor Core instructions and lack a compatible FP32 path, ELSA is hardware-agnostic and Tensor-Core independent, with implementations in Triton and CUDA C++.
  • Benchmarks show ELSA delivers 1.3–3.5× speedups on A100 FP32 (1K–16K tokens) versus memory-efficient SDPA and 1.97–2.27× gains on BERT, while also improving performance on Jetson TX2 and under LLaMA-13B offloading at long contexts.
  • The authors claim ELSA is a drop-in replacement that requires no retraining or weight changes and release code at the provided GitHub repository.

Abstract

Existing attention accelerators often trade exact softmax semantics, depend on fused Tensor Core kernels, or incur sequential depth that limits FP32 throughput on long sequences. We present \textbf{ELSA}, an algorithmic reformulation of online softmax attention that (i)~preserves exact softmax semantics in real arithmetic with a \emph{provable} \mathcal{O}(u\log n) FP32 relative error bound; (ii)~casts the online softmax update as a prefix scan over an associative monoid (m,S,W), yielding O(n) extra memory and O(\log n) parallel depth; and (iii)~is Tensor-Core independent, implemented in Triton and CUDA C++, and deployable as a \emph{drop-in replacement} requiring no retraining or weight modification. Unlike FlashAttention-2/3, which rely on HMMA/GMMA Tensor Core instructions and provide no compatible FP32 path, ELSA operates identically on A100s and resource-constrained edge devices such as Jetson TX2 -- making it the only hardware-agnostic exact-attention kernel that reduces parallel depth to O(\log n) at full precision. On A100 FP32 benchmarks (1K--16K tokens), ELSA delivers 1.3--3.5\times speedup over memory-efficient SDPA and 1.97--2.27\times on BERT; on Jetson TX2, ELSA achieves 1.5--1.6\times over Math (64--900 tokens), with 17.8--20.2\% throughput gains under LLaMA-13B offloading at \ge32K. In FP16, ELSA approaches hardware-fused baselines at long sequences while retaining full FP32 capability, offering a unified kernel for high-precision inference across platforms. Our code and implementation are available at https://github.com/ming053l/ELSA.