Scaling Attention via Feature Sparsity

arXiv cs.AI / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Sparse Feature Attention (SFA), which reduces transformer self-attention cost by representing queries/keys as k-sparse codes along a “feature sparsity” axis rather than the more common sequence-axis sparsification methods.
It estimates that SFA can cut attention complexity from Θ(n^2 d) to Θ(n^2 k^2/d) while aiming to preserve the expressivity needed for accuracy.
To run efficiently at scale, the authors propose FlashSFA, an IO-aware kernel extending FlashAttention to compute attention directly on sparse overlaps without building dense score matrices.
Experiments on GPT-2 and Qwen3 pretraining report up to 2.5× speedups and nearly 50% reductions in FLOPs and KV-cache usage, with maintained or improved long-context retrieval performance.
Benchmarks suggest SFA preserves robustness in long contexts and outperforms short-embedding baselines, positioning feature-level sparsity as a complementary approach for longer-context scaling with minimal quality loss.

Abstract

Scaling Transformers to ultra-long contexts is bottlenecked by the

O(n^2 d)

cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: feature sparsity. We propose Sparse Feature Attention (SFA), where queries and keys are represented as

k

-sparse codes that preserve high-dimensional expressivity while reducing the cost of attention from

\Theta(n^2 d)

\Theta(n^2 k^2/d)

. To make this efficient at scale, we introduce FlashSFA, an IO-aware kernel that extends FlashAttention to operate directly on sparse overlaps without materializing dense score matrices. Across GPT-2 and Qwen3 pretraining, SFA matches dense baselines while improving speed by up to

2.5\times

and reducing FLOPs and KV-cache by nearly 50\%. On synthetic and downstream benchmarks, SFA preserves retrieval accuracy and robustness at long contexts, outperforming short-embedding baselines that collapse feature diversity. These results establish feature-level sparsity as a complementary and underexplored axis for efficient attention, enabling Transformers to scale to orders-of-magnitude longer contexts with minimal quality loss. Code is available at https://github.com/YannX1e/Sparse-Feature-Attention.

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Dev.to

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

Reddit r/artificial

Why I Switched From GPT-4 to Small Language Models for Two of My Products

Dev.to

Orchestrating AI Velocity: Building a Decoupled Control Plane for Agentic Development

Dev.to

In the Kadrey v. Meta Platforms case, Judge Chabbria's quest to bust the fair use copyright defense to generative AI training rises from the dead!

Reddit r/artificial

Scaling Attention via Feature Sparsity

Key Points

Abstract

Related Articles

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

Why I Switched From GPT-4 to Small Language Models for Two of My Products

Orchestrating AI Velocity: Building a Decoupled Control Plane for Agentic Development

In the Kadrey v. Meta Platforms case, Judge Chabbria's quest to bust the fair use copyright defense to generative AI training rises from the dead!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer