FlashAttention-4: 1613 TFLOPs/s, 2.7x faster than Triton, written in Python. What it means for inference.

Reddit r/LocalLLaMA / 3/24/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • FlashAttention-4 (FA-4) delivers 1,613 TFLOPs/s BF16 forward on NVIDIA B200 with attention reaching near matmul-speed performance.
  • The article reports 2.1–2.7x speedups over Triton and up to ~1.3x vs cuDNN 9.13, attributing gains to reducing softmax bottlenecks via selective rescaling and a multi-stage pipeline.
  • FA-4 is integrated into vLLM 0.17.0 (released March 7) for B200 users, while PyTorch FlexAttention also provides an FA-4 backend with reported 1.2–3.2x improvements.
  • The method supports GQA and MQA (with multiple popular model families working) and adds sliding-window attention via a window_size parameter.
  • Coverage emphasizes that FA-4 targets Hopper/Blackwell only (not A100/consumer GPUs) due to reliance on Blackwell-specific hardware features, while the Python/CuTe-DSL angle is positioned as a key enabler for faster kernel iteration.

Wrote a deep dive on FlashAttention-4 (03/05/2026) that's relevant for anyone thinking about inference performance.

TL;DR for inference:

  • BF16 forward: 1,613 TFLOPs/s on B200 (71% utilization). Attention is basically at matmul speed now.
  • 2.1-2.7x faster than Triton, up to 1.3x faster than cuDNN 9.13
  • vLLM 0.17.0 (released March 7) integrates FA-4. If you're on B200, it's automatic.
  • PyTorch FlexAttention also has an FA-4 backend (1.2-3.2x over Triton backend)
  • GQA and MQA fully supported (Llama, Mistral, Qwen, Gemma all work)
  • Sliding window available via window_size parameter

Bad news for most of us:

FA-4 is Hopper + Blackwell only. Works on H100/H800 and B200/B100. Not on A100 or consumer cards. The optimizations exploit specific Blackwell hardware features (TMEM, 2-CTA MMA, async TMA) that don't exist on older GPUs.

If you're on A100: stay on FA-2.

If you're on H100: FA-4 is supported but gains are smaller than on Blackwell. Worth testing.

If you're on B200: just update vLLM and you're good.

The article breaks down why softmax (not matmul) is now the bottleneck on Blackwell, how selective rescaling skips ~10x of the softmax correction work, and the full 5-stage pipeline architecture.

Also covers the Python angle: FA-4 is 100% CuTe-DSL (NVIDIA's Python kernel DSL). Compiles in 2.5 seconds vs 55 seconds for the C++ equivalent. Same runtime perf. That's a big deal for kernel iteration speed.

Paper: https://arxiv.org/abs/2603.05451

Article free link: https://medium.com/ai-advances/flashattention-4-python-gpu-kernel-blackwell-2b18f51c8b32?sk=59bca93c369143e5f74fb0f86e57e6d0

For those running local models:

The algorithmic ideas (selective rescaling, software-emulated exp) will likely trickle down to consumer GPUs eventually. The CuTeDSL tooling is the real unlock for faster kernel development across the board.

submitted by /u/Sensitive-Two9732
[link] [comments]