kernel-anvil: 2x decode speedup on AMD by auto-tuning llama.cpp kernels per model shape

Reddit r/LocalLLaMA / 3/30/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • kernel-anvil is a profiling and configuration tool for GGUF models on AMD GPUs that identifies unique GEMV/MMVQ layer shapes and auto-tunes llama.cpp kernel parameters per shape at runtime (no recompilation required).
  • The approach fixes a performance limitation in llama.cpp where MMVQ kernels use identical thread/block settings across layers regardless of their shape, which wastes throughput on RDNA3.
  • On an RX 7900 XTX, kernel-anvil reports up to a 2.25x decode speedup for Qwen3.5-27B Q4_K_M (12 tok/s to 27 tok/s) and 1.2x–2.1x improvements across shapes for Qwen3-8B Q4_K_M.
  • The tool generates a JSON config file that a small (~50-line) patch to llama.cpp’s mmvq.cu (branch: smithy-shape-configs) can read at startup to apply optimal per-shape settings.
  • The author positions this as the first AMD-focused kernel optimization tool for llama.cpp, with CUDA/Metal support planned and potential upstreaming of the patch after more testing.

Built a tool that profiles your GGUF model's layer shapes on your AMD GPU and generates optimal kernel configs that llama.cpp loads at runtime. No recompilation needed.

The problem: llama.cpp's MMVQ kernels use the same thread/block configuration for every layer regardless of shape. A 1024-row GQA projection gets the same settings as a 17408-row FFN layer. This leaves significant performance on the table, especially on RDNA3.

The fix: kernel-anvil reads your GGUF, identifies the unique GEMV shapes, profiles each one on your actual GPU, and writes a JSON config file. A small patch to llama.cpp's mmvq.cu reads this config at startup and applies per-shape optimal nwarps and rows_per_block.

Results on 7900 XTX:

  • Qwen3.5-27B Q4_K_M: 12 tok/s -> 27 tok/s (2.25x)
  • Qwen3-8B Q4_K_M individual kernels: 1.2x-2.1x per shape

Usage:

pip install -e . kernel-anvil gguf-optimize ~/Models/my-model.gguf # <1 second SMITHY_CONFIG=~/.cache/smithy/my-model.json llama-server -m my-model.gguf -ngl 999 

The whole profiling + sweep takes under a second. 193 tests. Works with any GGUF model on RDNA3 (7900 XTX/XT, 7800 XT). CUDA/Metal support planned.

GitHub: https://github.com/apollosenvy/kernel-anvil

The llama.cpp patch (~50 lines to mmvq.cu) is on branch smithy-shape-configs. Considering upstreaming it as a PR once it gets more testing.

Background: This started from the recent wave of kernel optimization papers -- KernelSkill, CUDA Agent, KernelFoundry, TritonForge -- all targeting NVIDIA exclusively. Also drew inspiration from The Residual Stream Is All You Need (Qasim et al., March 2026) which got us thinking about what's actually bottlenecking inference on AMD. Turns out the answer was simpler than expected: llama.cpp's generic kernel configs just aren't tuned for the specific shapes each model uses.

Every existing kernel optimization tool targets NVIDIA. This is the first one for AMD.

submitted by /u/Apollosenvy
[link] [comments]