| I've been using a couple 32GB MI50s with my setup for the past 9 months. Most of my use-cases just rely on llama.cpp and it works like a charm now! (A huge leap compared to how things were back then) I would occasionally also dabble with ComfyUI to try out the new ImageGen/AudioGen models just for the fun of things. But one specific use case that was never practically feasible with MI50s for me was video generation. The problemI remember my previous encounters with Wan 2.2 where simple video generations would either OOM right away or take an insane 7-9 hours before I just give up and kill the process myself. I had no luck with the latest LTX models either. With a bit of research, I found how MI50s (gfx906) have zero memory-efficient attention support on PyTorch because they lack the matrix-multiplication cores for it. Every single fused attention implementation explicitly excludes gfx906:
Without fused attention, PyTorch falls back to Math SDPA, which materializes the full N x N attention score matrix. For a 2.5-second 480p video (17K tokens), that's 26 GB just for one attention layer's score matrix. For a 5-second 720p video (75K tokens), it's over 500 GB. Completely impossible on 32 GB. The DIY approachNaturally after the above findings, I was curious as to how llama.cpp handles this for my GPU though it lacks official FA support. Found out they have a generic tiling mechanism in place as a fallback for unsupported GPUs. With this as my inspiration, I decided to see if I could build something similar for PyTorch myself. Though this realm of coding is completely new to me, I was able to navigate it with AI assistance. The core idea is simple: instead of computing the full N x N score matrix at once, tile it into chunks that fit in memory. Instead of Though simple in theory, getting it to actually work reliably took about 28 iterations. Some of the things I had to figure out: What worked:
What didn't work or wasn't needed:
Where it landedThe kernel works and makes the following now possible on a single MI50 32GB: Video Generation (via ComfyUI):
Image Generation (Z-Image Turbo 6B via ComfyUI):
PyTorch LLM Inference — Qwen 2.5 0.5B (GQA, FP16):
All benchmarks at 150W power limit on a single MI50 32GB with 128 GB DDR4 RAM. Important note on DRAM: these VideoGen workflows rely on CPU offloading and you would need at least 64 GB of DRAM to comfortably experiment with various resolutions and video lengths. (Workflows used for Wan 2.2 5B and LTX 2.3 shared in my Git repo for reference) Also, have you noticed something?! It's actually faster too!The best part about the kernel is that it actually outperforms Math SDPA even at sequence lengths where Math SDPA can still run. Isolated attention benchmarks (B=1, H=16, D=64, FP16 on MI50):
The speedup likely comes from better L2 cache utilization where smaller chunks stay hot in cache instead of thrashing through a massive NxN matrix. This is a fundamental property of tiled attention (same reason Flash Attention is faster on NVIDIA too), so the direction should hold on other GPUs even if the exact numbers differ. To me, this made the kernel a perfect drop-in replacement for anything-PyTorch! Other areas where this could be usefulThe benchmarks above are just what I've personally tested but the kernel patches all SDPA calls globally. So it's not limited to ComfyUI or inference. It should in theory also help with:
From gfx906 to a broader releaseOriginally this was just a simple private DIY for my MI50. Had no plans of releasing it. But then I realized how the algorithm is pure PyTorch matmuls. Every AMD GPU without fused attention has the exact same problem:
That's a huge installed base of GPUs currently stuck on Math SDPA for attention-heavy workloads. So I packaged it as a generic, pip-installable library with automatic GPU detection. On supported GPUs, one import is all it takes: The detection system probes for efficient SDPA backends at startup. If your GPU has Flash Attention or mem_efficient, it stays out of the way. If not, it activates automatically. Repo: https://github.com/Lowkey-Loki-SN/noflash-attention Limitations and contributions welcomeI want to be upfront about the following:
If you have any of the above GPUs that would benefit from the kernel and want to try it out, I'd love to hear about your results! This is a side-project so I can't promise continued commitment towards refining this further but bug reports and compatibility feedback are welcome. Let the community do its thing! Bonus Fact: ROCm 7.2 + PyTorch from source works with gfx906Along the way, I also wanted to test whether ROCm 7.2 could work on gfx906 (it's not officially supported). And the answer is yes, if you build from source. I compiled ROCm 7.2 and then built PyTorch against it. gfx906 still works! The hardware support in the compiler (LLVM/AMDGPU) hasn't been removed, it's just not in the official build targets. I've been using it for a week and it's stable so far. I'mma end this with a 1080p 5-second audio-video clip generated with LTX-2.3 22B using this kernel on a single MI50! [link] [comments] |
Built a simple PyTorch flash-attention alternative for AMD GPUs that don't have it
Reddit r/LocalLLaMA / 3/28/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage
Key Points
- The author describes difficulty running video-generation models on AMD MI50 (gfx906) GPUs because PyTorch lacks memory-efficient/flash attention support for that architecture, causing severe memory blow-ups or extremely slow runs.
- They explain that common fused attention approaches (Composable Kernel, AOTriton, Flash Attention ROCm, and Triton) either require newer GPU instruction sets (gfx908+) or explicitly exclude gfx906.
- Without fused attention, PyTorch falls back to math SDPA that materializes the full N×N attention matrix, making longer/higher-resolution video prompts infeasible within 32GB VRAM.
- Drawing inspiration from llama.cpp’s tiling fallback for unsupported GPUs, they built a “simple PyTorch flash-attention alternative” that computes attention in memory-fitting tiles by processing query/key chunks instead of the full matrix.
Related Articles

Black Hat Asia
AI Business
Built a mortgage OCR system that hit 100% final accuracy in production (US/UK underwriting)
Reddit r/LocalLLaMA

# I Created a Pagination Challenge… And AI Missed the Real Problem
Dev.to

Xata Has a Free Serverless Database — PostgreSQL With Built-in Search, Analytics, and AI
Dev.to

The Real Stack Behind AI Agents in Production — MCP, Kubernetes, and What Nobody Tells You
Dev.to