RX 9070 (RDNA4/gfx1201) ROCm 7.2.1 llama.cpp Benchmarks — The Flash Attention Discovery

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

ROCm 7.2.1 on an AMD RX 9070 (RDNA4/gfx1201) can significantly outperform Vulkan for llama.cpp prompt processing when flash attention is enabled with the right ROCm/llama.cpp build flags.
For a dense model (Qwen3-8B), adding MMQ+HIP graphs and running with `--flash-attn` boosts prompt processing from ~711 t/s (ROCm default) to ~3,980 t/s (about a 5.5× improvement).
In an MoE model test (Qwen3.5-14B-A3B with 3B active params), ROCm improves prompt processing (roughly +12%) but Vulkan remains substantially better for token generation (about +23% on MoE).
Context-length scaling shows ROCm’s prompt-processing advantage narrows as context grows, reaching near parity around 8K context.
Multiple ROCm environment/build tweaks (e.g., `HSA_OVERRIDE_GFX_VERSION`, `HIP_FORCE_DEV_KERNELS`, `HIPBLAS_V2`) had little to no benefit or caused failures on gfx1201, suggesting the key gains come primarily from flash-attn + specific build configuration.

RX 9070 (RDNA4/gfx1201) ROCm 7.2.1 llama.cpp Benchmarks — The Flash Attention Discovery

https://preview.redd.it/3pjau5brllrg1.png?width=2501&format=png&auto=webp&s=181000a4046b8de02cc75c2a5c1776a3847ff34a

**Hardware:** AMD Ryzen 9 9900X | RX 9070 16GB VRAM (RDNA 4, gfx1201) | 192GB DDR5 | Ubuntu 24.04 **ROCm version:** 7.2.1 **llama.cpp build:** ROCm with `-DGGML_CUDA_FORCE_MMQ=ON -DGGML_HIP_GRAPHS=ON` --- ## TL;DR ROCm 7.2.1 on the RX 9070 (RDNA4) beats Vulkan on prompt processing once you enable flash attention and the right build flags. Token generation still favors Vulkan on MoE models. The default ROCm build is catastrophically slow — flash attention alone gives a 5.5× improvement on prompt processing for dense models. --- ## The Discovery: Flash Attention Changes Everything Testing ROCm out of the box was disappointing. Then I found the flags: ```bash cmake .. -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1201 \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_PREFIX_PATH=/opt/rocm-7.2.1 \ -DGGML_CUDA_FORCE_MMQ=ON \ -DGGML_HIP_GRAPHS=ON # Run with --flash-attn ``` **Dense model (Qwen3-8B Q8_0) — prompt processing:** - ROCm default, no flash attn: **711 t/s** - ROCm + flash attn only: **~3,980 t/s** - **5.5× improvement from one flag** --- ## Full Benchmark Results ### Qwen3.5-14B-A3B MXFP4 (MoE — 3B active params) | Config | pp512 (t/s) | tg128 (t/s) | |---|---|---| | Vulkan (FA on) | 3,332 | **113.2** | | ROCm default, no FA | 2,042 | 81.4 | | **ROCm MMQ+GRAPHS+FA** | **3,731** | 87.6 | **Verdict:** ROCm wins prompt processing (+12%), Vulkan wins token gen (+23% on MoE). ### Qwen3-8B Q8_0 (dense) | Config | pp512 (t/s) | tg128 (t/s) | |---|---|---| | Vulkan | 3,336 | 68.1 | | ROCm default, no FA | **711** | 60.6 | | **ROCm MMQ+GRAPHS+FA** | **3,931** | 64.2 | **Verdict:** ROCm wins prompt processing (+18%). Token gen roughly tied (+6% Vulkan). ### Context Scaling — Qwen3.5-14B-A3B MXFP4 | Context | Vulkan (t/s) | ROCm MMQ+FA (t/s) | Winner | |---|---|---|---| | pp512 | 3,184 | **3,731** | ROCm +17% | | pp2048 | 3,537 | **3,770** | ROCm +7% | | pp8192 | **3,280** | 3,191 | Vulkan +3% | ROCm's prompt processing advantage shrinks at long contexts. Roughly parity at 8K. --- ## What Didn't Work These had no meaningful impact or caused crashes: - `HSA_OVERRIDE_GFX_VERSION` — crashes or silent fail on gfx1201 - `HIP_FORCE_DEV_KERNELS` — no impact - `HIPBLAS_V2` — no impact - `GPU_MAX_WAVESPERCU` — no impact - Smaller ubatch sizes — hurt prompt processing performance --- ## Builds on My System - `~/src/llama.cpp/build/` — Vulkan (stable, good token gen on MoE) - `~/src/llama.cpp/build-rocm/` — ROCm default (don't use — the slow one) - `~/src/llama.cpp/build-rocm2/` — **ROCm MMQ+GRAPHS (current production)** Running production on port 8081 with ROCm MMQ+GRAPHS build, 262K context, flash attention on. --- ## Notes on gfx1201 / RDNA4 This is one of the first published benchmark sets I've seen for the RX 9070 on ROCm 7.2.1. The RDNA4 kernels are new and still maturing — I'd expect ROCm token gen performance to close the gap with Vulkan in future releases as gfx1201-specific optimizations land. bitsandbytes does not support gfx1201 yet (HIP `invalid device function` error). If you need bitsandbytes-based quantization, stick with Vulkan or wait for the next bitsandbytes release. --- ## Hardware Context The RX 9070 is paired with 192GB DDR5. For MoE models that can't fit in 16GB VRAM, the expert offload path (`-ot "exps=CPU"`) gives strong results — the 122B Qwen model runs at 14 tok/s vs 4.2 tok/s all-CPU. That benchmark is in a separate post. --- *Happy to answer questions or run specific benchmarks if useful.*

submitted by /u/Important_Quote_1180
[link] [comments]

[Boost]

Dev.to

Managing LLM context in a real application

Dev.to

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

Dev.to

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)

Dev.to

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned

Dev.to

RX 9070 (RDNA4/gfx1201) ROCm 7.2.1 llama.cpp Benchmarks — The Flash Attention Discovery

Key Points

Related Articles

[Boost]

Managing LLM context in a real application

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer