VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation

arXiv cs.LG / 4/15/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • FlashAttention-style online softmax is memory-efficient, but its non-matmul reduction/update steps (rowmax/rowsum and rescaling chains) can become vector/SIMD bottlenecks and dominate latency at high accelerator throughput.
  • The paper proposes Vector Relieved Flash Attention (VFA), which lowers the cost of running-maximum updates by initializing the max with a key-block approximation, reordering block traversal to stabilize the max early, and freezing the maximum for later blocks while keeping the online-softmax structure.
  • It extends this idea to Vector Relieved Sparse Attention (VSA) by integrating with block-sparse skipping (e.g., BLASST) to reduce both the number of blocks and per-block overhead.
  • VFA/VSA avoid the conditional rescale operation used in FA4.0’s update stage, and evaluations on benchmarks like MMLU and MATH500 report speedups over C16V32 baselines while avoiding performance loss.
  • Results show that smaller variants (e.g., C8V32, C4V32, C4V16) reach about 2× speedup on modern hardware, with the expectation of up to ~6× improvement on future architectures via better exponent capacity.

Abstract

FlashAttention-style online softmax enables exact attention computation with linear memory by streaming score tiles through on-chip memory and maintaining a running maximum and normalizer. However, as attention kernels approach peak tensor-core/cube-core throughput on modern accelerators, non-matmul components of online softmax -- especially per-tile rowmax and rowsum reductions and rescale chains -- can become vector or SIMD limited and dominate latency. This paper revisits FlashAttention and proposes Vector Relieved Flash Attention (VFA), a hardware-friendly method that reduces rowmax-driven updates of the running maximum while retaining the online-softmax structure. VFA initializes the running maximum via a cheap approximation from key-block representations, reorders key-block traversal to prioritize high-impact sink and local blocks, and freezes the maximum for remaining blocks to avoid repeated reductions and rescaling. We further integrate VFA with block-sparse skipping methods such as BLASST to form Vector Relieved Sparse Attention (VSA), which reduces both block count and per-block overhead. Notably, VFA and VSA completely avoid the conditional rescale operation in the update stage used in FA4.0. Extensive evaluations on benchmarks including MMLU and MATH500, together with attention statistics, verify our design: (i) sink and local reordering stabilizes the running maximum early; (ii) simple Q and K block summaries fail due to intra-block heterogeneity; (iii) m-initialization is required when maxima appear in middle blocks. Overall, VFA and VSA efficiently alleviate online-softmax reduction bottlenecks without performance loss. Compared to the C16V32 baseline, C8V32, C4V32 and C4V16 achieve nearly two times speedup on modern hardware while hitting the vector bottleneck. With upcoming architecture improvements, C4V16 will deliver six times speedup by enhancing exponent capacity.