Benchmarked Qwen3.5 (35B MoE, 27B Dense, 122B MoE) across Apple Silicon and AMD GPUs — ROCm vs Vulkan results were surprising, and context size matters

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The author benchmarks Qwen3.5 variants (35B MoE with 3B active, 27B dense, and 122B MoE with 10B active) using real pharmacovigilance-style prompts across Apple Silicon and AMD GPUs to decide which machines to keep for inference.
  • Results at 8K context show ROCm underperforming Vulkan on AMD for the 35B MoE model (e.g., ~78.9 tok/s vs ~133.0 tok/s on R9700/W7900), while Mac MLX performance is close to AMD Vulkan for this case.
  • For the 27B dense model, generation speeds are much lower overall and Vulkan still edges out ROCm on AMD (e.g., ~25.2 tok/s ROCm vs ~31.8 tok/s Vulkan on W7900), indicating backend/runtime effects.
  • The tests use consistent settings (4-bit quantization, single-user single-request, /no_think, temp 0.3) and extend beyond 8K by scaling context up to 196K, emphasizing that performance conclusions depend strongly on context length.
  • The article highlights practical friction and validation details around engine/builds (mlx-lm vs llama.cpp, ROCm vs AMDVLK, and a correction to previously listed Fedora binaries), reinforcing that benchmarking methodology and software versioning materially change outcomes.

Benchmarked Qwen3.5 across Apple Silicon and AMD GPUs — ROCm vs Vulkan results were surprising

I wanted to compare inference performance across my machines to decide whether keeping a new MacBook Pro was worth it alongside my GPU server. When I went looking for practical comparisons — real models, real workloads, Apple Silicon vs AMD GPUs, ROCm vs Vulkan — I couldn't find much beyond synthetic benchmarks or single-machine reviews. So I ran my own tests.

Setup

Hardware: - MacBook Pro — M5 Max, 48 GB unified - Mac Studio — M1 Max, 64 GB unified - Fedora 43 server — Core Ultra 7 265K, 192 GB DDR5, W7900 (48GB, RDNA3, PCIe Gen4 x8), R9700 (32GB, RDNA4, PCIe Gen5 x8)¹

Engines: mlx-lm 0.31 on Macs, llama.cpp on Fedora — both ROCm 7.2 build (914eb5f, 2026-03-25) and AMDVLK Vulkan build (24d2ee0, 2026-03-04). Correction: the original post incorrectly listed both Fedora binaries as b5065 — that was wrong. The version: 1 output doesn't show the build number. The actual commits are recent 2026 builds as shown above. The MacBook Pro llama.cpp tests in EDIT 3 used the Homebrew b8500 release.

Models: Qwen3.5-35B-A3B (MoE, 3B active), Qwen3.5-27B (dense), Qwen3.5-122B-A10B (MoE, 10B active). All 4-bit (MLX 4bit / GGUF Q4_K_M).

Benchmark: Domain-specific prompts from my actual work (pharmacovigilance data analysis — code generation, clinical reasoning, regulatory writing, structured extraction). 7 prompts at 8K context + context-scaling tests up to 196K. Single-user, single-request, /no_think, temp 0.3.


Results: Generation Speed (tok/s) — 8K Context

Qwen3.5-35B-A3B (MoE, 3B active)

Machine Backend Gen tok/s
Fedora R9700 AMDVLK Vulkan 133.0
MacBook Pro M5 Max MLX 128.0
Fedora W7900 AMDVLK Vulkan 123.7
Fedora W7900 ROCm 78.9
Fedora R9700 ROCm 68.8
Mac Studio M1 Max MLX 57.6

Qwen3.5-27B (Dense)

Machine Backend Gen tok/s
Fedora W7900 AMDVLK Vulkan 31.8
MacBook Pro M5 Max MLX 31.3
Fedora R9700 AMDVLK Vulkan 30.6
Fedora R9700 ROCm 25.2
Fedora W7900 ROCm 24.4
Mac Studio M1 Max MLX 15.0

Prompt Processing (tok/s, ~2.9K input)

Machine Backend 35B-A3B PP 27B PP
MacBook Pro M5 Max MLX 3,235 779
Fedora R9700 ROCm 1,190 547
Fedora W7900 ROCm 1,001 434
Fedora R9700 AMDVLK Vulkan 1,030 244
Fedora W7900 AMDVLK Vulkan 948 177
Mac Studio M1 Max MLX 431 67

ROCm vs Vulkan at 8K

AMDVLK Vulkan crushed ROCm on generation for single-GPU workloads:

GPU Model ROCm Gen Vulkan Gen Vulkan Advantage
R9700 35B-A3B 68.8 133.0 +93%
W7900 35B-A3B 78.9 123.7 +57%
W7900 27B 24.4 31.8 +30%
R9700 27B 25.2 30.6 +21%

But ROCm had 3.5-4x faster prompt processing on the 27B dense model at all context sizes.

Context Scaling: Single GPU (W7900, 32K allocation)

35B-A3B (MoE)

Prompt Tokens ROCm PP Vulkan PP ROCm Gen Vulkan Gen
1,137 1,537 1,534 84.2 132.0
4,415 1,524 1,435 83.3 129.3
8,824 1,452 1,332 81.6 119.2
17,635 1,297 1,121 79.2 116.6

27B (Dense)

Prompt Tokens ROCm PP Vulkan PP ROCm Gen Vulkan Gen
1,137 704 171 26.2 36.1
4,415 720 167 25.6 34.9
8,824 684 164 25.1 33.8
17,635 611 153 24.5 30.6

Pattern: ROCm's PP advantage grows with context. Vulkan's gen advantage shrinks with context but stays positive up to 16K on single GPU.


Key Takeaways

  1. M5 Max is fast. 128 tok/s on the MoE, 3,235 PP tok/s. Unified memory with no PCIe bottleneck is a real advantage. Worth keeping.

  2. Don't assume ROCm > Vulkan. For single-GPU inference, AMDVLK Vulkan was 30-93% faster on generation. Test both.

  3. But ROCm dominates PP on dense models — 3.5-4x faster. If your workload is long-context input (RAG, document analysis), ROCm's time-to-first-token advantage is massive.

  4. PCIe bandwidth matters. R9700 on Gen5 x8 beat W7900 on Gen4 x8 for MoE generation despite less VRAM and fewer CUs.

  5. MoE is the sweet spot for prosumer hardware. 35B-A3B at 4-bit: 123-133 tok/s on single AMD GPUs. The 27B dense at 25-32 tok/s is noticeably slower for similar benchmark quality.

Caveats

  • Domain-specific prompts — pharmacovigilance workloads. Your mileage will vary with other tasks.
  • PCIe slots are not equivalent — R9700 has 2x the bandwidth of W7900 (Gen5 x8 vs Gen4 x8). This confounds the GPU-vs-GPU comparison.
  • AMDVLK, not RADV — recent Mesa 25.3+ has improved RADV significantly for LLM inference. May give different results.
  • Quantization differs between MLX 4-bit and GGUF Q4_K_M.
  • Single-user only. No concurrent request testing.

¹ Also tested a W6800 (32GB, RDNA2, Gen4 x4 chipset slot) — couldn't run ROCm at all with Qwen3.5 (Gated Delta Net crash), and Vulkan performance was heavily bottlenecked by the x4 chipset link. Results omitted from main tables for clarity: 38.4 tok/s gen (35B-A3B), 18.0 tok/s gen (27B).


The benchmark scripts, orchestration, and this write-up were produced with the help of Claude Code (Claude Opus 4.6). I directed the testing strategy and hardware decisions; Claude wrote the benchmark harness, managed the model downloads, ran the tests across all machines via SSH, and drafted the post.


EDIT: Ran the full suite on the 122B model (dual GPU W7900+R9700, --split-mode layer). The pattern reverses — ROCm wins everything:

Metric ROCm Vulkan Winner
Gen tok/s (8K) 45.7 40.5 ROCm +13%
PP tok/s (2.9K) 735 588 ROCm +25%

Context scaling (8K to 16K) showed ROCm winning by +10-23% across the board. The crossover:

Model Active Params GPUs Gen Winner PP Winner
35B-A3B (MoE) 3B Single Vulkan +57-93% Roughly tied
27B (Dense) 27B Single Vulkan +21-30% ROCm 3.5-4x
122B-A10B (MoE) 10B Dual ROCm +13% ROCm +15-25%

TL;DR: Single GPU, small models → Vulkan. Multi-GPU, large models → ROCm.


EDIT 2: By request, tested large context with the 35B-A3B — single GPU (W7900, 131K allocation) and dual GPU (W7900+R9700, 262K allocation).

Single GPU (W7900) — up to 100K context

Context (tokens) ROCm PP Vulkan PP ROCm Gen Vulkan Gen
8,824 1,525 1,422 81.7 124.5
17,635 1,315 1,120 79.4 116.8
35,577 1,096 846 75.3 100.0
71,603 808 561 67.7 85.4
109,510 602 380 61.2 72.3

On a single card, Vulkan wins generation at all context sizes up to 100K, but the gap shrinks from +52% at 8K to +18% at 100K. ROCm's PP advantage grows from +7% to +59% over the same range.

Dual GPU (W7900+R9700) — up to 196K context

Context (tokens) ROCm PP Vulkan PP ROCm Gen Vulkan Gen
8,824 2,148 2,072 74.8 82.1
35,577 1,679 1,380 69.2 70.3
71,603 1,447 782 63.2 59.4
109,510 854 563 58.0 48.3
143,695 665 432 53.8 42.6
215,917 523 301 46.7 34.3

With dual GPU, there's a generation crossover around 65K context. Below that, Vulkan is slightly faster. Above it, ROCm pulls ahead and the gap widens — by 196K, ROCm is 36% faster on generation and 74% faster on PP.

The interactivity cliff

Regardless of backend, both ROCm and Vulkan suffer steep performance degradation at very large context — and it's the prompt processing drop that kills interactivity. On dual GPU Vulkan, PP falls from 2,072 tok/s at 8K to 301 tok/s at 196K — an 85% drop. That means a 196K-token prompt takes ~12 minutes just for time-to-first-token on Vulkan, vs ~7 minutes on ROCm. Even at 65K, you're waiting 50-90 seconds for the first token. Generation speed also degrades (82 → 34 tok/s on Vulkan, 75 → 47 on ROCm), but it's the PP wall-clock that makes large-context feel sluggish in practice. If you're doing long-context RAG or document analysis interactively, plan for this — the 262K native context is technically supported but the experience at 128K+ is very different from 8K.

ROCm stability note

ROCm crashed with a memory access fault on the R9700 (Memory access fault by GPU node-1 on address 0x7fedadca1000. Reason: Page not present or supervisor privilege.) when using the default multi-slot configuration at 65K+ context. The crash occurred during KV cache checkpoint reuse between requests. Limiting to -np 1 (single parallel slot) resolved it. Vulkan had zero stability issues at all context sizes up to 196K.

So the commenter who said ROCm doesn't do well at large context was right — both in terms of raw speed (Vulkan is faster below 65K) and stability (multi-slot crashes). But above 65K, ROCm recovers and actually leads on generation, if you work around the stability issue.


EDIT 3: Fair point that the original comparison used MLX 4-bit on the Macs and GGUF Q4_K_M on fedora — these are different quantization formats with different file sizes, so it's not apples-to-apples. I installed llama.cpp b8500 (Metal) on the MacBook Pro and ran the exact same GGUF files (copied from the fedora machine).

All llama.cpp GGUF Q4_K_M — Same Files Everywhere

Qwen3.5-35B-A3B (MoE)

Machine Backend Gen tok/s PP tok/s (2.9K)
Fedora R9700 AMDVLK Vulkan 133.0 1,030
Fedora W7900 AMDVLK Vulkan 123.7 948
MacBook Pro M5 Max Metal (b8500) 89.4 783
Fedora W7900 ROCm 78.9 1,001
Fedora R9700 ROCm 68.8 1,190

Qwen3.5-27B (Dense)

Machine Backend Gen tok/s PP tok/s (2.9K)
Fedora W7900 AMDVLK Vulkan 31.8 177
Fedora R9700 AMDVLK Vulkan 30.6 244
Fedora R9700 ROCm 25.2 547
Fedora W7900 ROCm 24.4 434
MacBook Pro M5 Max Metal (b8500) 23.7 171

With the same GGUF files, the fedora GPUs on Vulkan beat the M5 Max on generation for both models. The MacBook Pro's strong showing in the original post was partly due to MLX's optimization advantage over llama.cpp on Apple Silicon, not just the hardware.

MLX vs llama.cpp on the MacBook Pro (separate comparison)

These use different quantization formats and file sizes, so this is an engine comparison, not a pure speed comparison:

Model MLX 4-bit Gen llama.cpp Q4_K_M Gen MLX Advantage
35B-A3B 128.0 89.4 +43%
27B 31.3 23.7 +32%

MLX is significantly faster on Apple Silicon, but the MLX 4-bit models are also smaller than the Q4_K_M GGUFs — the speed difference can't be attributed purely to the inference engine. A proper comparison would need same-size quantizations or a quality metric like KLD drift between the two formats.


EDIT 4: A commenter correctly pointed out that the W6800 ROCm crash was likely a build issue, not an architecture limitation — they run Qwen3.5 on even older GPUs (Radeon Pro VII, gfx906) with ROCm. Checked the build config and confirmed: the ROCm binary was compiled with AMDGPU_TARGETS=gfx1100;gfx1201 only — gfx1030 was never included. Rebuilt with gfx1030;gfx1100;gfx1201 and the W6800 now works perfectly with ROCm.

W6800 ROCm vs Vulkan (corrected)

Qwen3.5-35B-A3B (MoE)

Backend Gen tok/s PP tok/s (2.9K)
ROCm (gfx1030 build) 58.3 1,359
AMDVLK Vulkan 38.4 534
ROCm advantage +52% +155%

Qwen3.5-27B (Dense)

Backend Gen tok/s PP tok/s (2.9K)
ROCm 19.3 316
AMDVLK Vulkan 18.0 143
ROCm advantage +7% +121%

On the W6800, ROCm is faster than Vulkan on both generation and PP — the opposite of the W7900/R9700 results. This is interesting: the RDNA 2 card benefits from ROCm while the newer RDNA 3/4 cards benefit from Vulkan. The W6800 is also on a PCIe Gen4 x4 chipset slot, which mainly bottlenecks PP rather than generation (the model fits entirely in VRAM so generation doesn't need PCIe bandwidth).

The original claim that "RDNA 2 can't run ROCm with Gated Delta Net models" was wrong — it was a build configuration error. Thanks to the commenter who flagged this.

submitted by /u/neuromacmd
[link] [comments]

Benchmarked Qwen3.5 (35B MoE, 27B Dense, 122B MoE) across Apple Silicon and AMD GPUs — ROCm vs Vulkan results were surprising, and context size matters | AI Navigate