| I saw a post about incoming MTP support in llama.cpp so i tried it out on a AI max 395 with 128GB DDR5 8000: Result : between 60 and 80 token/s from 40ish token/s without MTP (on the screen i was trying rocm but it's more like 40-45 token/s with vulkan) depending on the subject (some common math stuff seems to be the fastest). PP seems unchanged. The two GGUF on the screen capture are almost the same size : around 36GB each I have yet to try it on qwen 3.5 122B and there will be some tweaks to do with launch parameters but it's really impressive !! [link] [comments] |
MTP on strix halo with llama.cpp (PR #22673)
Reddit r/LocalLLaMA / 5/6/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- A Reddit user tested upcoming MTP support in llama.cpp on an AMD Strix Halo (AI Max 395) with 128GB DDR5-8000 by building a radv container using the related amd-strix-halo-toolboxes and llama.cpp PR #22673.
- Using a Qwen3.6-35B MTP GGUF and running with the parameters `--spec-type mtp --spec-draft-n-max 3`, they observed performance gains to roughly 60–80 tokens/s versus about 40–45 tokens/s without MTP (with the caveat that Vulkan/ROCm setup affected the baseline).
- The improvement varied by prompt subject, with common math examples appearing fastest, while perplexity (PP) reportedly remained largely unchanged.
- The tested GGUF files on the screen capture were similar in size (around 36GB each), and the user noted they still plan to validate on a larger Qwen 3.5 122B model with additional tuning.
Related Articles

Black Hat USA
AI Business

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents
Dev.to

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool
Dev.to
AI is getting better at doing things, but still bad at deciding what to do?
Reddit r/artificial

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny
Dev.to