For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention.

Reddit r/LocalLLaMA / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The author reports poor vLLM performance on both Radeon R9700 (4-GPU Threadripper Pro setup) and an AMD MI300X when running long-context workloads, with throughput collapsing beyond ~64k tokens.
They discovered that AITER attention backends in vLLM/ROCm are disabled behind an environment-variable gate, and that enabling/patching AITER support is required to avoid the long-context bottleneck.
After patching vLLM to add gfx1201 support for AITER Unified Attention, their tests with Qwen3.6-27B-FP8 show improved long-context behavior, though integration details matter for supported architectures.
They note a key limitation: the approach currently supports only FP16/BF16 KV cache (no FP8 KV cache), and Qwen3.6’s hybrid architecture can crash Unified Attention due to TILE_SIZE/KV block-size constraints.
Overall, the post is targeted at a small audience of users running vLLM across multiple AMD GPUs and suggests specific code changes to remove gfx1201 gating and reuse similar kernels from MI350X/RDNA4 implementations.

For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention.

I have a 4 x R9700 system on Threadripper pro, but I have never been happy with the performance of my GPUs in vLLM. I have started benchmarking any new model I try out with llama-benchy so that I can get a better idea of how models of different sizes and architectures compare on my system. In every model that I have tested, I run into a wall around 64k tokens context. TTFT, TG and PP would all fall on their face at long context lengths.

So this past weekend I rented a MI300X from RunPod thinking that AMD must have this issue sorted on CDNA. When loading up vLLM with Qwen3.6-27B-FP8 I noticed that vLLM was selecting ROCm Attention instead of one of the AITER attention backends which I though was strange, but I pushed on with my benchmarking runs. After a run of llama-benchy I saw that the MI300X had the same issue that my R9700s do at long context lengths. At >64k context my TG/s would fall to single digit numbers. This prompted me to go searching for an AMD runbook on running vLLM on the MI300X and found that the AITER attention mechanisms are gated behind an env var that you have to explicitly enable. With this new found information, I went back to trying to patch vLLM and AITER support for gfx1201.

I already have a patched version of vLLM that that I build to bring FP8 support to the R9700 which is built ontop of the AITER Triton kernels. I had some issues when I was first patching in AITER support so I disabled everything but the Triton kernels in order to get FP8 working. Most of the patching for AITER and vLLM just require removing gates that block gfx1201, or adding that architecture to wherever you see MI350X (my understanding is that the MI350X and RDNA4 implement FP8 in the same or very similar way to the point that you can use some of the MI350X kernels on RDNA4). All of my testing was done around Qwen3.6 27B since this model finally gives us close to SOTA performance at home. Being that Qwen3.6 is a hybrid architecture, it kept crashing the AITER Unified Attention due to a mismatch in expected TILE_SIZE, something about AITER only supports kv block sizes that are a power of two.

The main downside I have found so far, if you can call it that, is that you can only run FP16/BF16 KV Cache. Not that you would need to quantize your cache with the Qwen3.6 family since its cache footprint is already tiny. But just something to be aware of if you do decide to try it out.

I have attached some of my benchmark runs of Qwen3.6 on my R9700s and the MI300X I rented. I have not been able to rent a MI300X from runpod again to test with AITER Attention since there has been no availability the past few days. Im sorry that there is no pre-aiter benchmark, I seem to have overwritten that benchmark while I was troubleshooting. I do have my original benchmarks from Qwen3.6 35B that I will attach. I have also attached a benchmark with MTP enabled and set to 3 tokens, as far as I can tell for single concurrency, it is free performance. At high context on concurrency 2, the TG performance drops off pretty sharply at high context depths. The llama-benchy runs are TG128 and PP2048 at each of the context depths.

https://preview.redd.it/akh0wyumrrxg1.png?width=1254&format=png&auto=webp&s=20977698edcdff99c55625b7cd7886cc9a77ad4d

https://preview.redd.it/glhduyumrrxg1.png?width=1254&format=png&auto=webp&s=ebf5da011e34ac36d287e11a4d507f987de28c61

https://preview.redd.it/pn2gnxumrrxg1.png?width=1254&format=png&auto=webp&s=fa35f0420ed61053ee064e817f2a8a7312dff2a5

https://preview.redd.it/m5pr4xumrrxg1.png?width=1254&format=png&auto=webp&s=b8e5e51b8d79937d22e72198755d38b1df51c5fd

https://preview.redd.it/ojf241vmrrxg1.png?width=1254&format=png&auto=webp&s=5e00bbc5c95e40f5c69f53da34123469b74e1574

submitted by /u/AustinM731
[link] [comments]