Built a simple PyTorch flash-attention alternative for AMD GPUs that don't have it

Reddit r/LocalLLaMA / 3/28/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The author describes difficulty running video-generation models on AMD MI50 (gfx906) GPUs because PyTorch lacks memory-efficient/flash attention support for that architecture, causing severe memory blow-ups or extremely slow runs.
They explain that common fused attention approaches (Composable Kernel, AOTriton, Flash Attention ROCm, and Triton) either require newer GPU instruction sets (gfx908+) or explicitly exclude gfx906.
Without fused attention, PyTorch falls back to math SDPA that materializes the full N×N attention matrix, making longer/higher-resolution video prompts infeasible within 32GB VRAM.
Drawing inspiration from llama.cpp’s tiling fallback for unsupported GPUs, they built a “simple PyTorch flash-attention alternative” that computes attention in memory-fitting tiles by processing query/key chunks instead of the full matrix.

Built a simple PyTorch flash-attention alternative for AMD GPUs that don't have it

I've been using a couple 32GB MI50s with my setup for the past 9 months. Most of my use-cases just rely on llama.cpp and it works like a charm now! (A huge leap compared to how things were back then)

I would occasionally also dabble with ComfyUI to try out the new ImageGen/AudioGen models just for the fun of things. But one specific use case that was never practically feasible with MI50s for me was video generation.

The problem

I remember my previous encounters with Wan 2.2 where simple video generations would either OOM right away or take an insane 7-9 hours before I just give up and kill the process myself. I had no luck with the latest LTX models either.

With a bit of research, I found how MI50s (gfx906) have zero memory-efficient attention support on PyTorch because they lack the matrix-multiplication cores for it. Every single fused attention implementation explicitly excludes gfx906:

Composable Kernel (CK): requires MFMA matrix instructions (gfx908+)
AOTriton: rejects gfx906 at compile time
Flash Attention ROCm: requires gfx90a+
Triton: closed gfx906 support as "not planned"

Without fused attention, PyTorch falls back to Math SDPA, which materializes the full N x N attention score matrix. For a 2.5-second 480p video (17K tokens), that's 26 GB just for one attention layer's score matrix. For a 5-second 720p video (75K tokens), it's over 500 GB. Completely impossible on 32 GB.

The DIY approach

Naturally after the above findings, I was curious as to how llama.cpp handles this for my GPU though it lacks official FA support. Found out they have a generic tiling mechanism in place as a fallback for unsupported GPUs.

With this as my inspiration, I decided to see if I could build something similar for PyTorch myself. Though this realm of coding is completely new to me, I was able to navigate it with AI assistance.

The core idea is simple: instead of computing the full N x N score matrix at once, tile it into chunks that fit in memory.

Instead of S = Q @ K.T (OOM at 17K+ tokens), you loop over small query chunks, compute S_chunk = Q_chunk @ K.T (fits in ~1 GB), run softmax, multiply by V, and accumulate. Same math, O(N) memory instead of O(N².)

Though simple in theory, getting it to actually work reliably took about 28 iterations. Some of the things I had to figure out:

What worked:

Tiling along the query dimension with auto-tuned block sizes
Three-tier fallback: standard chunked -> online softmax (K-tiled) -> in-place manual softmax
BF16 -> FP16 auto-conversion (gfx906 has no BF16 hardware)
Flattened GQA GEMMs instead of broadcasting (better hardware utilization)
A softmax FTZ (flush-to-zero) threshold to prevent FP16 denormal NaN issues
FFN chunking with runtime safety verification for additional memory savings

What didn't work or wasn't needed:

Custom HIP kernels — pure PyTorch matmuls turned out to be fast enough
Triton — gfx906 support was experimental and abandoned
Aggressive block sizes — smaller isn't always better, the auto-tuning finds the sweet spot

Where it landed

The kernel works and makes the following now possible on a single MI50 32GB:

Video Generation (via ComfyUI):

Model	Resolution	Duration	Time	Without kernel
Wan 2.2 5B	832x480	2.5s	5:04	OOM (needs 38 GB)
Wan 2.2 5B	1280x720	5s	1:19:39	OOM (needs 500+ GB)
LTX-2.3 22B	1280x704	5.2s with audio	20:18	OOM
LTX-2.3 22B	1920x1080	5.2s with audio	1:03:26	OOM

Image Generation (Z-Image Turbo 6B via ComfyUI):

Resolution	Without Kernel	With Kernel	Speedup	VRAM Saved
512x512	22.1s / 25.6 GB	22.0s / 21.0 GB	~same	18%
1024x1024	59.5s / 17.7 GB	57.2s / 15.4 GB	3% faster	13%
1536x1536	157.9s / 30.8 GB	112.7s / 16.4 GB	29% faster	47%

PyTorch LLM Inference — Qwen 2.5 0.5B (GQA, FP16):

Context	Math SDPA	With kernel	Speedup
1K tokens	189 ms	178 ms	1.06x
2K tokens	437 ms	380 ms	1.15x
4K tokens	1209 ms	944 ms	1.28x
8K tokens	3985 ms	2734 ms	1.46x
16K tokens	OOM	8880 ms	—

All benchmarks at 150W power limit on a single MI50 32GB with 128 GB DDR4 RAM.

Important note on DRAM: these VideoGen workflows rely on CPU offloading and you would need at least 64 GB of DRAM to comfortably experiment with various resolutions and video lengths. (Workflows used for Wan 2.2 5B and LTX 2.3 shared in my Git repo for reference)

Also, have you noticed something?!

It's actually faster too!

The best part about the kernel is that it actually outperforms Math SDPA even at sequence lengths where Math SDPA can still run. Isolated attention benchmarks (B=1, H=16, D=64, FP16 on MI50):

Sequence Length	Math SDPA	noflash-attention	Speedup	VRAM Saved
256	0.28 ms / 47 MB	0.18 ms / 38 MB	1.6x	19%
512	0.55 ms / 79 MB	0.29 ms / 53 MB	1.9x	33%
1024	1.83 ms / 198 MB	0.85 ms / 106 MB	2.2x	46%
2048	8.72 ms / 652 MB	4.74 ms / 308 MB	1.8x	53%
4096	28.81 ms / 2424 MB	17.93 ms / 1096 MB	1.6x	55%
8192	102.42 ms / 9424 MB	72.75 ms / 1124 MB	1.4x	88%
16384	OOM	1325.69 ms / 1202 MB	Only option	—

The speedup likely comes from better L2 cache utilization where smaller chunks stay hot in cache instead of thrashing through a massive NxN matrix. This is a fundamental property of tiled attention (same reason Flash Attention is faster on NVIDIA too), so the direction should hold on other GPUs even if the exact numbers differ. To me, this made the kernel a perfect drop-in replacement for anything-PyTorch!

Other areas where this could be useful

The benchmarks above are just what I've personally tested but the kernel patches all SDPA calls globally. So it's not limited to ComfyUI or inference. It should in theory also help with:

Longer context fine-tuning: Tier 1 supports autograd, so the memory savings directly translate to training. A context length that used to OOM during attention could now fit on the same GPU. LoRA fine-tuning with longer sequences becomes practical.
Any PyTorch app that uses transformers: diffusers, HuggingFace Transformers, etc.., if it calls F.scaled_dot_product_attention and your GPU doesn't have an efficient backend, this kernel makes it usable.

From gfx906 to a broader release

Originally this was just a simple private DIY for my MI50. Had no plans of releasing it. But then I realized how the algorithm is pure PyTorch matmuls. Every AMD GPU without fused attention has the exact same problem:

Vega 56/64 (gfx900) — same era as MI50, no MFMA
RX 5600/5700 (RDNA 1) — no fused attention in any library
RX 6600-6900 XT (RDNA 2) — CK and AOTriton don't support these either

That's a huge installed base of GPUs currently stuck on Math SDPA for attention-heavy workloads.

So I packaged it as a generic, pip-installable library with automatic GPU detection. On supported GPUs, one import is all it takes:

pip install noflash-attention import noflash_attention # auto-patches SDPA — done

The detection system probes for efficient SDPA backends at startup. If your GPU has Flash Attention or mem_efficient, it stays out of the way. If not, it activates automatically.

Repo: https://github.com/Lowkey-Loki-SN/noflash-attention

Limitations and contributions welcome

I want to be upfront about the following:

All benchmarks are from a single MI50 32GB. I don't have Vega 56/64 or RX 5000/6000 cards to test on. Performance will vary based on memory bandwidth, compute units, and VRAM.
Multi-GPU has not been validated. The patch should work with data parallelism (it operates on individual SDPA calls), but tensor parallelism and ring attention haven't been tested.
Training: Tier 1 (standard chunked) supports autograd. Tiers 2 and 3 are inference-only.
torch.compile and CUDA graphs are not supported (dynamic block sizing).
Entirety of the kernel is vibe-coded and I was just orchestrating, testing and providing directional advice.

If you have any of the above GPUs that would benefit from the kernel and want to try it out, I'd love to hear about your results! This is a side-project so I can't promise continued commitment towards refining this further but bug reports and compatibility feedback are welcome. Let the community do its thing!

Bonus Fact: ROCm 7.2 + PyTorch from source works with gfx906

Along the way, I also wanted to test whether ROCm 7.2 could work on gfx906 (it's not officially supported). And the answer is yes, if you build from source. I compiled ROCm 7.2 and then built PyTorch against it. gfx906 still works! The hardware support in the compiler (LLVM/AMDGPU) hasn't been removed, it's just not in the official build targets. I've been using it for a week and it's stable so far.

I'mma end this with a 1080p 5-second audio-video clip generated with LTX-2.3 22B using this kernel on a single MI50!

https://reddit.com/link/1s614i8/video/n3498o3alsrg1/player

submitted by /u/Lowkey_LokiSN
[link] [comments]

Black Hat Asia

AI Business

Built a mortgage OCR system that hit 100% final accuracy in production (US/UK underwriting)

Reddit r/LocalLLaMA

# I Created a Pagination Challenge… And AI Missed the Real Problem

Dev.to

Xata Has a Free Serverless Database — PostgreSQL With Built-in Search, Analytics, and AI

Dev.to

The Real Stack Behind AI Agents in Production — MCP, Kubernetes, and What Nobody Tells You

Dev.to

Built a simple PyTorch flash-attention alternative for AMD GPUs that don't have it

Key Points

The problem

The DIY approach

Where it landed

It's actually faster too!

Other areas where this could be useful

From gfx906 to a broader release

Limitations and contributions welcome

Bonus Fact: ROCm 7.2 + PyTorch from source works with gfx906

Related Articles

Black Hat Asia

Built a mortgage OCR system that hit 100% final accuracy in production (US/UK underwriting)

# I Created a Pagination Challenge… And AI Missed the Real Problem

Xata Has a Free Serverless Database — PostgreSQL With Built-in Search, Analytics, and AI

The Real Stack Behind AI Agents in Production — MCP, Kubernetes, and What Nobody Tells You

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer