DUAL_7900XTX_BENCHMARK_POST.txt ✕ Close
Dual RX 7900 XTX — Qwen3.5-35B-A3B Inference Benchmark
Date: 2026-03-27 Hardware: 2x AMD Radeon RX 7900 XTX (48GB VRAM total, 384-bit GDDR6 per card) CPU: Ryzen 9 5900XT (16C/32T), 64GB DDR4 OS: Ubuntu 24.04.4 LTS, Kernel 6.17.0-1012-oem Backend: Vulkan (RADV NAVI31, Mesa), llama.cpp build b8516 Model: Huihui-Qwen3.5-35B-A3B-abliterated.Q4_K_M.gguf (19.71 GiB, 34.66B params, ~3B active) Split: Layer split across both GPUs (-ngl 99, default split)
TOKEN GENERATION (llama-bench, 3 repetitions)
| Test | tok/s |
|---|---|
| tg128 | 123.08±0.14 |
PROMPT PROCESSING (llama-bench, 2 repetitions)
| Test | tok/s |
|---|---|
| pp1 | 118.46±0.45 |
| pp16 | 325.08±1.98 |
| pp64 | 833.12±28.4 |
| pp256 | 1945.28±1.04 |
| pp512 | 2647.13±13.21 |
| pp1024 | 3181.31±305 |
| pp2048 | 3822.73±30.9 |
COMPARISON WITH PUBLISHED BENCHMARKS (same model: Qwen3.5-35B-A3B)
Sources: [1] HuggingFace ubergarm/Qwen3.5-35B-A3B-GGUF/discussions/1 [2] llama.cpp Discussion #10879 (Vulkan performance) [3] llama.cpp Discussion #15021 (ROCm/HIP performance) [4] llama.cpp Discussion #19890 (RDNA4 R9700 vs RTX 5090) [5] InsiderLLM Qwen3.5 local guide [6] Level1Techs dual 7900 XTX thread
TOKEN GENERATION — Qwen3.5-35B-A3B (or similar MoE 30-35B A3B)
| GPU | Backend | Quant | TG tok/s | Source |
|---|---|---|---|---|
| Dual 7900 XTX | HIP | Q4_0 | 47 | [1] |
| Single 7900 XTX | HIP | Q4_0 | 76-78 | [1] |
| Single 7900 XTX | Vulkan | Q4_0 | 95-105 | [1] |
| Single W7900 | Vulkan | Q8_0 | ~48 | [6] |
| RTX 3090 | CUDA | Q4_K_M | 111 | [5] |
| RTX 5090 | CUDA | Q4_K_M | 165 | [5] |
| Radeon AI PRO R9700 | Vulkan | Q4_K_XL | 127 | [4] |
| >>> Dual 7900 XTX | Vulkan | Q4_K_M | 123 | This |
PROMPT PROCESSING — Qwen3.5-35B-A3B
| GPU | Backend | Quant | PP512 tok/s | Source |
|---|---|---|---|---|
| Dual 7900 XTX | HIP | Q4_0 | 1,090-1,355 | [1] |
| Single 7900 XTX | HIP | Q4_0 | 1,153-2,237 | [1] |
| Single 7900 XTX | Vulkan | Q4_0 | 2,105-2,472 | [1] |
| >>> Dual 7900 XTX | Vulkan | Q4_K_M | 2,647 | This |
CROSS-MODEL REFERENCE (Llama 2 7B Q4_0 — standard benchmark)
| GPU | Backend | PP512 tok/s | TG128 tok/s | Source |
|---|---|---|---|---|
| Single 7900 XTX | HIP+FA | 3,874 | 170 | [3] |
| Single 7900 XTX | Vulkan | 3,532 | 191 | [2] |
| Dual 7900 XTX | HIP | 330 (70B) | 13.4 (70B) | [3] |
vLLM COMPARISON (same hardware, same model)
We also tested vLLM 0.17.1rc1 with ROCm 7.0 on the same dual 7900 XTX setup.
| Framework | Backend | Model | TG tok/s | PP tok/s | Status |
|---|---|---|---|---|---|
| vLLM 0.17.1 | ROCm/HIP | Qwen3.5-35B Q4_K_M | 5 | N/A | Broken output |
| vLLM 0.17.1 | ROCm/HIP | Qwen3.5-35B FP16 | OOM | N/A | Does not load |
| vLLM 0.17.1 | ROCm+FP8 MoE | Qwen3.5-35B | OOM→33.7GB | N/A | MI300X only |
| llama.cpp | HIP+graphs | Qwen3.5-35B Q4_K_M | 86.66 | ~1,345 | Working |
| llama.cpp | Vulkan | Qwen3.5-35B Q4_K_M | 123.08 | 3,829 | Working |
Notes on vLLM: - vLLM's GGUF MoE quantization path produced multi-language garbage output (random Chinese, Korean, Spanish tokens) at ~5 tok/s on gfx1100. The same GGUF file produces coherent output on llama.cpp. - vLLM's FP8 MoE quantization (--quantization fp8) reduced VRAM from 60GB to 33.7GB but only works on MI300X (CDNA3), not gfx1100 (RDNA3). - The AITER MoE kernel fusion library (VLLM_ROCM_USE_AITER_MOE=1) is MI300X-only and will not compile on RDNA3. - vLLM's Triton kernels are not optimized for RDNA3's wave32 architecture.
Bottom line: vLLM is not viable for MoE inference on RX 7900 XTX. llama.cpp Vulkan delivers 24.6x the token generation speed (123 vs 5 tok/s).
KEY OBSERVATIONS
Vulkan outperforms HIP/ROCm on RDNA3 for MoE workloads.
- TG: 123 tok/s (Vulkan) vs 47 tok/s (dual HIP) = 2.6x faster
- This contradicts the common recommendation to use ROCm over Vulkan on AMD GPUs. For MoE models with small active parameter counts, Vulkan's GEMV path achieves higher thread utilization on the small-K expert matrices.
Dual 7900 XTX on Vulkan beats single RTX 3090 on CUDA (123 vs 111) for the same model at the same quantization.
PP scales well up to ubatch=512 (3,829 tok/s at PP2048), matching single-GPU 7B model speeds despite running a 5.5x larger model. MoE architecture (3B active) enables this.
These GPUs cost $800-900 each. Two of them ($1600-1800) outperform a single RTX 3090 ($1500) and approach RTX 5090 ($2000) territory while providing 48GB total VRAM vs 24GB/32GB.
CONFIGURATION NOTES
- Vulkan backend with RADV (Mesa) driver, NOT amdvlk
- Layer split mode (default, -ngl 99)
- Both GPUs detected as: AMD Radeon RX 7900 XTX (RADV NAVI31)
- warp size: 64, shared memory: 65536, int dot: 1
- KHR_coopmat: supported
- GPUs confirmed at profile_peak (1249 MHz MCLK) during all measurements
- No flash attention used for these benchmarks
- ubatch=512 (default) for prompt processing
RAW llama-bench OUTPUT
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| qwen35moe 35B.A3B Q4_K - Medium | 19.71 GiB | 34.66 B | Vulkan | 99 | tg128 | 123.08 ± 0.14 |
| qwen35moe 35B.A3B Q4_K - Medium | 19.71 GiB | 34.66 B | Vulkan | 99 | pp1 | 118.46 ± 0.45 |
| qwen35moe 35B.A3B Q4_K - Medium | 19.71 GiB | 34.66 B | Vulkan | 99 | pp16 | 325.08 ± 1.98 |
| qwen35moe 35B.A3B Q4_K - Medium | 19.71 GiB | 34.66 B | Vulkan | 99 | pp64 | 833.12 ± 28.4 |
| qwen35moe 35B.A3B Q4_K - Medium | 19.71 GiB | 34.66 B | Vulkan | 99 | pp256 | 1945.28 ± 1.04 |
| qwen35moe 35B.A3B Q4_K - Medium | 19.71 GiB | 34.66 B | Vulkan | 99 | pp512 | 2647.13 ± 13.21 |
| qwen35moe 35B.A3B Q4_K - Medium | 19.71 GiB | 34.66 B | Vulkan | 99 | pp1024 | 3181.31 ± 305 |
| qwen35moe 35B.A3B Q4_K - Medium | 19.71 GiB | 34.66 B | Vulkan | 99 | pp2048 | 3822.73 ± 30.9 |
[link] [comments]
![[Boost]](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D800%252Cheight%3D%252Cfit%3Dscale-down%252Cgravity%3Dauto%252Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Fuser%252Fprofile_image%252F3618325%252F470cf6d0-e54c-4ddf-8d83-e3db9f829f2b.jpg&w=3840&q=75)



