Built a custom llama.cpp backend that dispatches GEMM ops directly to the XDNA2 NPU on Ryzen AI MAX 385 (Strix Halo). No iGPU and no shared memory contention.
Model: Meta-Llama-3.1-8B-Instruct Q4_K_M
Hardware: Ryzen AI MAX 385, CachyOS 6.19, amdxdna driver, XRT 2.21.75 2.21.75
Results
| Backend | Prefill (t/s pp512) | Decode (t/s tg64) | Avg Power | J/tok |
|---|---|---|---|---|
| Vulkan prefill + NPU decode | 930 | 43.7 | 41.5 W | 0.947 |
| Vulkan only | 833 | 41.6 | 52.2 W | 1.3 |
| CPU only | 4.6 | 3.76 | — | — |
The NPU decode path saves ~10W vs Vulkan-only while matching (slightly beating) decode throughput, because the iGPU is free for other work.
Stack
- Kernels: mlir-aie xclbins (Xilinx/mlir-aie, Apache 2.0)
- Runtime dispatch: XRT 2.21.75
- Base: fork of ggml-org/llama.cpp (MIT)
- 4 xclbin slots covering different K-dimension tiles, MIN_N/MAX_N routing to pick the right kernel at runtime
Ceiling investigation
Tried everything to push past 43.7 t/s decode:
- Batch sweep N=1..64: flat. No improvement.
- Int4 double-quant: killed SNR (44.8 → 19.7 dB). Dead end.
- Cascade offload: ruled out by AMD docs.
- Speculative decoding with Llama-3.2-1B draft (44% accept rate, 212 t/s draft): zero effective gain.
Spec decoding not helping is the interesting one, normally a 44% accept rate would buy you something. It didn't in this scenario, which confirms the bottleneck is LPDDR5's bandwidth, not compute. The NPU is already hitting the memory wall. 43.7 t/s is the ceiling for this model on this hardware.
Links
- GitHub: https://github.com/BrandedTamarasu-glitch/OllamaAMDNPU
- Changelog: https://brandedtamarasu-glitch.github.io/OllamaAMDNPU/xdna-npu/
Built with Claude Sonnet 4.6 / Claude Code — disclosed because it's relevant to reproducibility.
Anyone running Strix Halo or Phoenix with the amdxdna driver — what decode throughput are you seeing on comparable quants? Curious whether other XDNA2 configurations hit the same wall or if there's headroom I haven't found.
[link] [comments]