Scaling Attention via Feature Sparsity

arXiv cs.AI / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Sparse Feature Attention (SFA), which reduces transformer self-attention cost by representing queries/keys as k-sparse codes along a “feature sparsity” axis rather than the more common sequence-axis sparsification methods.
  • It estimates that SFA can cut attention complexity from Θ(n^2 d) to Θ(n^2 k^2/d) while aiming to preserve the expressivity needed for accuracy.
  • To run efficiently at scale, the authors propose FlashSFA, an IO-aware kernel extending FlashAttention to compute attention directly on sparse overlaps without building dense score matrices.
  • Experiments on GPT-2 and Qwen3 pretraining report up to 2.5× speedups and nearly 50% reductions in FLOPs and KV-cache usage, with maintained or improved long-context retrieval performance.
  • Benchmarks suggest SFA preserves robustness in long contexts and outperforms short-embedding baselines, positioning feature-level sparsity as a complementary approach for longer-context scaling with minimal quality loss.

Abstract

Scaling Transformers to ultra-long contexts is bottlenecked by the O(n^2 d) cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: feature sparsity. We propose Sparse Feature Attention (SFA), where queries and keys are represented as k-sparse codes that preserve high-dimensional expressivity while reducing the cost of attention from \Theta(n^2 d) to \Theta(n^2 k^2/d). To make this efficient at scale, we introduce FlashSFA, an IO-aware kernel that extends FlashAttention to operate directly on sparse overlaps without materializing dense score matrices. Across GPT-2 and Qwen3 pretraining, SFA matches dense baselines while improving speed by up to 2.5\times and reducing FLOPs and KV-cache by nearly 50\%. On synthetic and downstream benchmarks, SFA preserves retrieval accuracy and robustness at long contexts, outperforming short-embedding baselines that collapse feature diversity. These results establish feature-level sparsity as a complementary and underexplored axis for efficient attention, enabling Transformers to scale to orders-of-magnitude longer contexts with minimal quality loss. Code is available at https://github.com/YannX1e/Sparse-Feature-Attention.