Dual 7900 XTX hitting 123 tok/s on Qwen3.5-35B (Vulkan backend)

Reddit r/LocalLLaMA / 3/28/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • 2026-03-27のベンチマークでは、2基のRadeon RX 7900 XTX(Vulkanバックエンド、llama.cpp/Ubuntu 24.04.4)でQwen3.5-35B-A3B相当モデルを動作させ、生成スループット(tg128)が約123.08 tok/s(±0.14)を記録しました。
  • 同じ構成でプロンプト処理(prompt processing)はpp1が118.46 tok/s、pp16〜pp2048で段階的に大きく伸び、pp512で約2647.13 tok/s、pp2048で約3822.73 tok/sとなりました。
  • 既存の公開ベンチマークとの比較では、単体7900 XTXのVulkan性能(Q4_0で95〜105 tok/s)やCUDA系(RTX 3090/5090)との数値差が整理され、今回の「デュアル×7900 XTX(Vulkan)」は生成性能で123 tok/s付近の結果として提示されています。
  • 構成面では、Qwen3.5-35B-A3Bをllama.cppでQ4_K_M量子化した約19.71GiBモデルを用い、レイヤ分割を両GPUに割り当てる設定(層分割・ngl指定)で測定している点が特徴です。

DUAL_7900XTX_BENCHMARK_POST.txt ✕ Close

Dual RX 7900 XTX — Qwen3.5-35B-A3B Inference Benchmark

Date: 2026-03-27 Hardware: 2x AMD Radeon RX 7900 XTX (48GB VRAM total, 384-bit GDDR6 per card) CPU: Ryzen 9 5900XT (16C/32T), 64GB DDR4 OS: Ubuntu 24.04.4 LTS, Kernel 6.17.0-1012-oem Backend: Vulkan (RADV NAVI31, Mesa), llama.cpp build b8516 Model: Huihui-Qwen3.5-35B-A3B-abliterated.Q4_K_M.gguf (19.71 GiB, 34.66B params, ~3B active) Split: Layer split across both GPUs (-ngl 99, default split)

TOKEN GENERATION (llama-bench, 3 repetitions)

Test tok/s
tg128 123.08±0.14

PROMPT PROCESSING (llama-bench, 2 repetitions)

Test tok/s
pp1 118.46±0.45
pp16 325.08±1.98
pp64 833.12±28.4
pp256 1945.28±1.04
pp512 2647.13±13.21
pp1024 3181.31±305
pp2048 3822.73±30.9

COMPARISON WITH PUBLISHED BENCHMARKS (same model: Qwen3.5-35B-A3B)

Sources: [1] HuggingFace ubergarm/Qwen3.5-35B-A3B-GGUF/discussions/1 [2] llama.cpp Discussion #10879 (Vulkan performance) [3] llama.cpp Discussion #15021 (ROCm/HIP performance) [4] llama.cpp Discussion #19890 (RDNA4 R9700 vs RTX 5090) [5] InsiderLLM Qwen3.5 local guide [6] Level1Techs dual 7900 XTX thread

TOKEN GENERATION — Qwen3.5-35B-A3B (or similar MoE 30-35B A3B)

GPU Backend Quant TG tok/s Source
Dual 7900 XTX HIP Q4_0 47 [1]
Single 7900 XTX HIP Q4_0 76-78 [1]
Single 7900 XTX Vulkan Q4_0 95-105 [1]
Single W7900 Vulkan Q8_0 ~48 [6]
RTX 3090 CUDA Q4_K_M 111 [5]
RTX 5090 CUDA Q4_K_M 165 [5]
Radeon AI PRO R9700 Vulkan Q4_K_XL 127 [4]
>>> Dual 7900 XTX Vulkan Q4_K_M 123 This

PROMPT PROCESSING — Qwen3.5-35B-A3B

GPU Backend Quant PP512 tok/s Source
Dual 7900 XTX HIP Q4_0 1,090-1,355 [1]
Single 7900 XTX HIP Q4_0 1,153-2,237 [1]
Single 7900 XTX Vulkan Q4_0 2,105-2,472 [1]
>>> Dual 7900 XTX Vulkan Q4_K_M 2,647 This

CROSS-MODEL REFERENCE (Llama 2 7B Q4_0 — standard benchmark)

GPU Backend PP512 tok/s TG128 tok/s Source
Single 7900 XTX HIP+FA 3,874 170 [3]
Single 7900 XTX Vulkan 3,532 191 [2]
Dual 7900 XTX HIP 330 (70B) 13.4 (70B) [3]

vLLM COMPARISON (same hardware, same model)

We also tested vLLM 0.17.1rc1 with ROCm 7.0 on the same dual 7900 XTX setup.

Framework Backend Model TG tok/s PP tok/s Status
vLLM 0.17.1 ROCm/HIP Qwen3.5-35B Q4_K_M 5 N/A Broken output
vLLM 0.17.1 ROCm/HIP Qwen3.5-35B FP16 OOM N/A Does not load
vLLM 0.17.1 ROCm+FP8 MoE Qwen3.5-35B OOM→33.7GB N/A MI300X only
llama.cpp HIP+graphs Qwen3.5-35B Q4_K_M 86.66 ~1,345 Working
llama.cpp Vulkan Qwen3.5-35B Q4_K_M 123.08 3,829 Working

Notes on vLLM: - vLLM's GGUF MoE quantization path produced multi-language garbage output (random Chinese, Korean, Spanish tokens) at ~5 tok/s on gfx1100. The same GGUF file produces coherent output on llama.cpp. - vLLM's FP8 MoE quantization (--quantization fp8) reduced VRAM from 60GB to 33.7GB but only works on MI300X (CDNA3), not gfx1100 (RDNA3). - The AITER MoE kernel fusion library (VLLM_ROCM_USE_AITER_MOE=1) is MI300X-only and will not compile on RDNA3. - vLLM's Triton kernels are not optimized for RDNA3's wave32 architecture.

Bottom line: vLLM is not viable for MoE inference on RX 7900 XTX. llama.cpp Vulkan delivers 24.6x the token generation speed (123 vs 5 tok/s).

KEY OBSERVATIONS

  1. Vulkan outperforms HIP/ROCm on RDNA3 for MoE workloads.

    • TG: 123 tok/s (Vulkan) vs 47 tok/s (dual HIP) = 2.6x faster
    • This contradicts the common recommendation to use ROCm over Vulkan on AMD GPUs. For MoE models with small active parameter counts, Vulkan's GEMV path achieves higher thread utilization on the small-K expert matrices.
  2. Dual 7900 XTX on Vulkan beats single RTX 3090 on CUDA (123 vs 111) for the same model at the same quantization.

  3. PP scales well up to ubatch=512 (3,829 tok/s at PP2048), matching single-GPU 7B model speeds despite running a 5.5x larger model. MoE architecture (3B active) enables this.

  4. These GPUs cost $800-900 each. Two of them ($1600-1800) outperform a single RTX 3090 ($1500) and approach RTX 5090 ($2000) territory while providing 48GB total VRAM vs 24GB/32GB.

CONFIGURATION NOTES

  • Vulkan backend with RADV (Mesa) driver, NOT amdvlk
  • Layer split mode (default, -ngl 99)
  • Both GPUs detected as: AMD Radeon RX 7900 XTX (RADV NAVI31)
    • warp size: 64, shared memory: 65536, int dot: 1
    • KHR_coopmat: supported
  • GPUs confirmed at profile_peak (1249 MHz MCLK) during all measurements
  • No flash attention used for these benchmarks
  • ubatch=512 (default) for prompt processing

RAW llama-bench OUTPUT

model size params backend ngl test t/s
qwen35moe 35B.A3B Q4_K - Medium 19.71 GiB 34.66 B Vulkan 99 tg128 123.08 ± 0.14
qwen35moe 35B.A3B Q4_K - Medium 19.71 GiB 34.66 B Vulkan 99 pp1 118.46 ± 0.45
qwen35moe 35B.A3B Q4_K - Medium 19.71 GiB 34.66 B Vulkan 99 pp16 325.08 ± 1.98
qwen35moe 35B.A3B Q4_K - Medium 19.71 GiB 34.66 B Vulkan 99 pp64 833.12 ± 28.4
qwen35moe 35B.A3B Q4_K - Medium 19.71 GiB 34.66 B Vulkan 99 pp256 1945.28 ± 1.04
qwen35moe 35B.A3B Q4_K - Medium 19.71 GiB 34.66 B Vulkan 99 pp512 2647.13 ± 13.21
qwen35moe 35B.A3B Q4_K - Medium 19.71 GiB 34.66 B Vulkan 99 pp1024 3181.31 ± 305
qwen35moe 35B.A3B Q4_K - Medium 19.71 GiB 34.66 B Vulkan 99 pp2048 3822.73 ± 30.9
submitted by /u/Neither-Temporary131
[link] [comments]
広告

Dual 7900 XTX hitting 123 tok/s on Qwen3.5-35B (Vulkan backend) | AI Navigate