Wrote a deep dive on FlashAttention-4 (03/05/2026) that's relevant for anyone thinking about inference performance.
TL;DR for inference:
- BF16 forward: 1,613 TFLOPs/s on B200 (71% utilization). Attention is basically at matmul speed now.
- 2.1-2.7x faster than Triton, up to 1.3x faster than cuDNN 9.13
- vLLM 0.17.0 (released March 7) integrates FA-4. If you're on B200, it's automatic.
- PyTorch FlexAttention also has an FA-4 backend (1.2-3.2x over Triton backend)
- GQA and MQA fully supported (Llama, Mistral, Qwen, Gemma all work)
- Sliding window available via window_size parameter
Bad news for most of us:
FA-4 is Hopper + Blackwell only. Works on H100/H800 and B200/B100. Not on A100 or consumer cards. The optimizations exploit specific Blackwell hardware features (TMEM, 2-CTA MMA, async TMA) that don't exist on older GPUs.
If you're on A100: stay on FA-2.
If you're on H100: FA-4 is supported but gains are smaller than on Blackwell. Worth testing.
If you're on B200: just update vLLM and you're good.
The article breaks down why softmax (not matmul) is now the bottleneck on Blackwell, how selective rescaling skips ~10x of the softmax correction work, and the full 5-stage pipeline architecture.
Also covers the Python angle: FA-4 is 100% CuTe-DSL (NVIDIA's Python kernel DSL). Compiles in 2.5 seconds vs 55 seconds for the C++ equivalent. Same runtime perf. That's a big deal for kernel iteration speed.
Paper: https://arxiv.org/abs/2603.05451
Article free link: https://medium.com/ai-advances/flashattention-4-python-gpu-kernel-blackwell-2b18f51c8b32?sk=59bca93c369143e5f74fb0f86e57e6d0
For those running local models:
The algorithmic ideas (selective rescaling, software-emulated exp) will likely trickle down to consumer GPUs eventually. The CuTeDSL tooling is the real unlock for faster kernel development across the board.
[link] [comments]
