llama.cpp docker images to run MTP models

Reddit r/LocalLLaMA / 5/13/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The article announces new Docker images for llama.cpp that make it easier to run MTP models, since keeping local build guides up to date is difficult.
  • It provides multiple GPU/acceleration “flavors” (CUDA 13/12, Vulkan, Intel, and ROCm), with the author noting they primarily tested CUDA 13 but encourages others to try their hardware.
  • It also points out that Unsloth released new MTP variants for Qwen 3.6, making the author’s previously grafted MTP models obsolete, and links to Hugging Face files.
  • The author discusses quantization trade-offs for MTP layers (e.g., using Q8 for better prediction versus higher quantization for potential speed gains at higher VRAM cost).
  • A sample `docker run` command is included, highlighting that `--spec-type mtp` and `--spec-draft-n-max 3` are the most critical settings for enabling MTP behavior.

There have been many improvements to the MTP pull request and the llama.cpp main branch, such as image support and various bug fixes. I recently made a new build for my local machine, but keeping guides up to date is an issue, so I built Docker images to make running them easier. If you are already using llama.cpp Docker images, it would be straightforward to switch over until official builds support MTP.

Here, pick your flavour:

havenoammo/llama:cuda13-server havenoammo/llama:cuda12-server havenoammo/llama:vulkan-server havenoammo/llama:intel-server havenoammo/llama:rocm-server

I have not been able to test all of them, as I only run cuda13 for now. Feel free to give it a test and see if it works for your hardware.

Also, Unsloth released MTP models for Qwen 3.6, which makes my previous grafted models obsolete. You can find them here if you missed them:

https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF

I believe they quantize some of the MTP layers. I kept mine at Q8 quantization for improved prediction. It is possible that higher quantization for MTP layers makes them more precise, giving you more speed at the cost of more VRAM usage. I will keep my versions for now until I finish doing some benchmarks and I am sure they are fully obsolete.

Finally, here is how I use it:

docker run --gpus all --rm \ -p 8080:8080 \ -v ./models:/models \ havenoammo/llama:cuda13-server \ -m /models/Qwen3.6-27B-MTP-UD-Q8_K_XL.gguf \ --port 8080 \ --host 0.0.0.0 \ -n -1 \ --parallel 1 \ --ctx-size 262144 \ --fit-target 844 \ --mmap \ -ngl -1 \ --flash-attn on \ --metrics \ --temp 1.0 \ --top-p 0.95 \ --top-k 20 \ --jinja \ --chat-template-kwargs '{"preserve_thinking":true}' \ --ubatch-size 512 \ --batch-size 2048 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --spec-type mtp \ --spec-draft-n-max 3 Adjust as you see fit. What matters most for MTP is --spec-type mtp and --spec-draft-n-max 3.

submitted by /u/havenoammo
[link] [comments]