Offloading LLM matrix multiplication to the AMD XDNA2 NPU on Ryzen AI MAX 385 : 43.7 t/s decode at 0.947 J/tok

Reddit r/LocalLLaMA / 3/27/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • The author built a custom llama.cpp backend that dispatches GEMM (matrix multiplication) decode work directly to the AMD XDNA2 NPU on the Ryzen AI MAX 385 (Strix Halo), avoiding iGPU use and reducing contention.

Built a custom llama.cpp backend that dispatches GEMM ops directly to the XDNA2 NPU on Ryzen AI MAX 385 (Strix Halo). No iGPU and no shared memory contention.

Model: Meta-Llama-3.1-8B-Instruct Q4_K_M

Hardware: Ryzen AI MAX 385, CachyOS 6.19, amdxdna driver, XRT 2.21.75 2.21.75

Results

Backend Prefill (t/s pp512) Decode (t/s tg64) Avg Power J/tok
Vulkan prefill + NPU decode 930 43.7 41.5 W 0.947
Vulkan only 833 41.6 52.2 W 1.3
CPU only 4.6 3.76

The NPU decode path saves ~10W vs Vulkan-only while matching (slightly beating) decode throughput, because the iGPU is free for other work.

Stack

  • Kernels: mlir-aie xclbins (Xilinx/mlir-aie, Apache 2.0)
  • Runtime dispatch: XRT 2.21.75
  • Base: fork of ggml-org/llama.cpp (MIT)
  • 4 xclbin slots covering different K-dimension tiles, MIN_N/MAX_N routing to pick the right kernel at runtime

Ceiling investigation

Tried everything to push past 43.7 t/s decode:

  • Batch sweep N=1..64: flat. No improvement.
  • Int4 double-quant: killed SNR (44.8 → 19.7 dB). Dead end.
  • Cascade offload: ruled out by AMD docs.
  • Speculative decoding with Llama-3.2-1B draft (44% accept rate, 212 t/s draft): zero effective gain.

Spec decoding not helping is the interesting one, normally a 44% accept rate would buy you something. It didn't in this scenario, which confirms the bottleneck is LPDDR5's bandwidth, not compute. The NPU is already hitting the memory wall. 43.7 t/s is the ceiling for this model on this hardware.

Links

Built with Claude Sonnet 4.6 / Claude Code — disclosed because it's relevant to reproducibility.

Anyone running Strix Halo or Phoenix with the amdxdna driver — what decode throughput are you seeing on comparable quants? Curious whether other XDNA2 configurations hit the same wall or if there's headroom I haven't found.

submitted by /u/brandedtamarasu
[link] [comments]