| Testing ROCm 7 using TheRock nightly tarballs against Vulkan on Mi50. System Setup
Models TestedAll models run with -fa 1 and default f16 cache types using llama-bench
Prompt ProcessingVulkan at short context (sub-16k) is reliably faster than ROCm on dense-models only (Q3.5 9B and 27B). At long context on dense models or basically any context length on MOE models, ROCm is consistently faster. Token GenerationAll generations standardized at 256 tokens at varying depths. The pattern from Prompt Processing repeats here; Vulkan is faster with dense models. Speed doesn't decay with depth as much as prompt processing does. If you're using MOEs and especially split GPU/CPU inference, ROCm is faster. Conclusions
LimitationsTheRock's ROCm nightly builds are not a stable release. You probably will encounter weird behavior. Whether a ROCm bug or a Llama.cpp bug I am not sure, but I currently cannot run ROCm llama-server with Qwen 3.5B 27B Q8 because it keeps trying to allocate the 8192MB prompt cache to VRAM instead of system ram causing an OOM error (-cram 0 isn't disabling it, -cram 1024 doesn't lower the size, don't know why). Runs with Vulkan though. I also noticed what seemed to be a memory leak with a different ROCm nightly from a few weeks ago and an earlier llama.cpp version, which was resolved by switching back to Vulkan. OpenCode with 100k+ context resulted in memory usage on the GPU slowly creeping up from 90% up to an OOM using Qwen Next Coder and a ROCm nightly build. I have not tried to replicate it since switching back to ROCm and the newer nightly version though. I'm an ex-dev turned product manager just learning and doing this as a hobby though, so it's fine :) Full data set: https://pastebin.com/4pPuGAcV [link] [comments] |
Llama.cpp Mi50 ROCm 7 対 Vulkan ベンチマーク
Reddit r/LocalLLaMA / 2026/3/23
💬 オピニオンDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
要点
- Mi50 32GB システム(EPYC 7532、Proxmox 仮想化、Ubuntu Server 24.04、カーネル 6.8)上で、llama.cpp ビルド 8467 および llama-bench を用いた ROCm 7.13 nightly 対 Vulkan 1.4.341.1 の比較ベンチマーク。
- テスト対象モデルには Qwen 3.5 9B/27B/122B および Nemotron Cascade 2 を含み、122B は -ncmoe 28 設定で CPU にオフロードされる(-mmp 0)。
- プロンプト処理では、密度の高いモデルの短いコンテキスト実行で Vulkan が高速で、長いコンテキストや MOE モデルでは ROCm が優位となり、特に GPU/CPU を分割した推論でその傾向が顕著である。
- 256トークンの生成でも同じパターンが見られ、MOE シナリオでは再度 ROCm が有利になる。ビルドでは -fa 1 を使用し、デフォルトの f16 キャッシュを利用した。




