VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation
arXiv cs.LG / 4/15/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- FlashAttention-style online softmax is memory-efficient, but its non-matmul reduction/update steps (rowmax/rowsum and rescaling chains) can become vector/SIMD bottlenecks and dominate latency at high accelerator throughput.
- The paper proposes Vector Relieved Flash Attention (VFA), which lowers the cost of running-maximum updates by initializing the max with a key-block approximation, reordering block traversal to stabilize the max early, and freezing the maximum for later blocks while keeping the online-softmax structure.
- It extends this idea to Vector Relieved Sparse Attention (VSA) by integrating with block-sparse skipping (e.g., BLASST) to reduce both the number of blocks and per-block overhead.
- VFA/VSA avoid the conditional rescale operation used in FA4.0’s update stage, and evaluations on benchmarks like MMLU and MATH500 report speedups over C16V32 baselines while avoiding performance loss.
- Results show that smaller variants (e.g., C8V32, C4V32, C4V16) reach about 2× speedup on modern hardware, with the expectation of up to ~6× improvement on future architectures via better exponent capacity.
Related Articles

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG
Dev.to
Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]
Reddit r/MachineLearning

How AI Interview Assistants Are Changing Job Preparation in 2026
Dev.to

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness
Dev.to

NEW PROMPT INJECTION
Dev.to