https://preview.redd.it/3pjau5brllrg1.png?width=2501&format=png&auto=webp&s=181000a4046b8de02cc75c2a5c1776a3847ff34a
**Hardware:** AMD Ryzen 9 9900X | RX 9070 16GB VRAM (RDNA 4, gfx1201) | 192GB DDR5 | Ubuntu 24.04 **ROCm version:** 7.2.1 **llama.cpp build:** ROCm with `-DGGML_CUDA_FORCE_MMQ=ON -DGGML_HIP_GRAPHS=ON` --- ## TL;DR ROCm 7.2.1 on the RX 9070 (RDNA4) beats Vulkan on prompt processing once you enable flash attention and the right build flags. Token generation still favors Vulkan on MoE models. The default ROCm build is catastrophically slow — flash attention alone gives a 5.5× improvement on prompt processing for dense models. --- ## The Discovery: Flash Attention Changes Everything Testing ROCm out of the box was disappointing. Then I found the flags: ```bash cmake .. -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1201 \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_PREFIX_PATH=/opt/rocm-7.2.1 \ -DGGML_CUDA_FORCE_MMQ=ON \ -DGGML_HIP_GRAPHS=ON # Run with --flash-attn ``` **Dense model (Qwen3-8B Q8_0) — prompt processing:** - ROCm default, no flash attn: **711 t/s** - ROCm + flash attn only: **~3,980 t/s** - **5.5× improvement from one flag** --- ## Full Benchmark Results ### Qwen3.5-14B-A3B MXFP4 (MoE — 3B active params) | Config | pp512 (t/s) | tg128 (t/s) | |---|---|---| | Vulkan (FA on) | 3,332 | **113.2** | | ROCm default, no FA | 2,042 | 81.4 | | **ROCm MMQ+GRAPHS+FA** | **3,731** | 87.6 | **Verdict:** ROCm wins prompt processing (+12%), Vulkan wins token gen (+23% on MoE). ### Qwen3-8B Q8_0 (dense) | Config | pp512 (t/s) | tg128 (t/s) | |---|---|---| | Vulkan | 3,336 | 68.1 | | ROCm default, no FA | **711** | 60.6 | | **ROCm MMQ+GRAPHS+FA** | **3,931** | 64.2 | **Verdict:** ROCm wins prompt processing (+18%). Token gen roughly tied (+6% Vulkan). ### Context Scaling — Qwen3.5-14B-A3B MXFP4 | Context | Vulkan (t/s) | ROCm MMQ+FA (t/s) | Winner | |---|---|---|---| | pp512 | 3,184 | **3,731** | ROCm +17% | | pp2048 | 3,537 | **3,770** | ROCm +7% | | pp8192 | **3,280** | 3,191 | Vulkan +3% | ROCm's prompt processing advantage shrinks at long contexts. Roughly parity at 8K. --- ## What Didn't Work These had no meaningful impact or caused crashes: - `HSA_OVERRIDE_GFX_VERSION` — crashes or silent fail on gfx1201 - `HIP_FORCE_DEV_KERNELS` — no impact - `HIPBLAS_V2` — no impact - `GPU_MAX_WAVESPERCU` — no impact - Smaller ubatch sizes — hurt prompt processing performance --- ## Builds on My System - `~/src/llama.cpp/build/` — Vulkan (stable, good token gen on MoE) - `~/src/llama.cpp/build-rocm/` — ROCm default (don't use — the slow one) - `~/src/llama.cpp/build-rocm2/` — **ROCm MMQ+GRAPHS (current production)** Running production on port 8081 with ROCm MMQ+GRAPHS build, 262K context, flash attention on. --- ## Notes on gfx1201 / RDNA4 This is one of the first published benchmark sets I've seen for the RX 9070 on ROCm 7.2.1. The RDNA4 kernels are new and still maturing — I'd expect ROCm token gen performance to close the gap with Vulkan in future releases as gfx1201-specific optimizations land. bitsandbytes does not support gfx1201 yet (HIP `invalid device function` error). If you need bitsandbytes-based quantization, stick with Vulkan or wait for the next bitsandbytes release. --- ## Hardware Context The RX 9070 is paired with 192GB DDR5. For MoE models that can't fit in 16GB VRAM, the expert offload path (`-ot "exps=CPU"`) gives strong results — the 122B Qwen model runs at 14 tok/s vs 4.2 tok/s all-CPU. That benchmark is in a separate post. --- *Happy to answer questions or run specific benchmarks if useful.*
submitted by