ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers

arXiv cs.LG / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces ELSA, an algorithmic reformulation of online softmax attention that preserves exact softmax semantics while providing a provable O(u log n) FP32 relative error bound.
ELSA reformulates the online softmax update as a prefix scan over an associative monoid, achieving O(n) extra memory and O(log n) parallel depth, enabling faster long-sequence attention.
Unlike FlashAttention-style kernels that depend on specific Tensor Core instructions and lack a compatible FP32 path, ELSA is hardware-agnostic and Tensor-Core independent, with implementations in Triton and CUDA C++.
Benchmarks show ELSA delivers 1.3–3.5× speedups on A100 FP32 (1K–16K tokens) versus memory-efficient SDPA and 1.97–2.27× gains on BERT, while also improving performance on Jetson TX2 and under LLaMA-13B offloading at long contexts.
The authors claim ELSA is a drop-in replacement that requires no retraining or weight changes and release code at the provided GitHub repository.

Abstract

Existing attention accelerators often trade exact softmax semantics, depend on fused Tensor Core kernels, or incur sequential depth that limits FP32 throughput on long sequences. We present \textbf{ELSA}, an algorithmic reformulation of online softmax attention that (i)~preserves exact softmax semantics in real arithmetic with a \emph{provable}

\mathcal{O}(u\log n)

FP32 relative error bound; (ii)~casts the online softmax update as a prefix scan over an associative monoid

(m,S,W)

, yielding

O(n)

extra memory and

O(\log n)

parallel depth; and (iii)~is Tensor-Core independent, implemented in Triton and CUDA C++, and deployable as a \emph{drop-in replacement} requiring no retraining or weight modification. Unlike FlashAttention-2/3, which rely on HMMA/GMMA Tensor Core instructions and provide no compatible FP32 path, ELSA operates identically on A100s and resource-constrained edge devices such as Jetson TX2 -- making it the only hardware-agnostic exact-attention kernel that reduces parallel depth to

O(\log n)

at full precision. On A100 FP32 benchmarks (1K--16K tokens), ELSA delivers

1.3

3.5\times

speedup over memory-efficient SDPA and

1.97

2.27\times

on BERT; on Jetson TX2, ELSA achieves

1.5

1.6\times

over Math (64--900 tokens), with

17.8

20.2\%

throughput gains under LLaMA-13B offloading at

\ge

32K. In FP16, ELSA approaches hardware-fused baselines at long sequences while retaining full FP32 capability, offering a unified kernel for high-precision inference across platforms. Our code and implementation are available at https://github.com/ming053l/ELSA.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/28DailyView insight →

Black Hat USA

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

How I Automate My Dev Workflow with Claude Code Hooks

Dev.to

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™

Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

Dev.to

ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat USA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

How I Automate My Dev Workflow with Claude Code Hooks

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer