MTP on strix halo with llama.cpp (PR #22673)

Reddit r/LocalLLaMA / 5/6/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

A Reddit user tested upcoming MTP support in llama.cpp on an AMD Strix Halo (AI Max 395) with 128GB DDR5-8000 by building a radv container using the related amd-strix-halo-toolboxes and llama.cpp PR #22673.
Using a Qwen3.6-35B MTP GGUF and running with the parameters `--spec-type mtp --spec-draft-n-max 3`, they observed performance gains to roughly 60–80 tokens/s versus about 40–45 tokens/s without MTP (with the caveat that Vulkan/ROCm setup affected the baseline).
The improvement varied by prompt subject, with common math examples appearing fastest, while perplexity (PP) reportedly remained largely unchanged.
The tested GGUF files on the screen capture were similar in size (around 36GB each), and the user noted they still plan to validate on a larger Qwen 3.5 122B model with additional tuning.

MTP on strix halo with llama.cpp (PR #22673)

I saw a post about incoming MTP support in llama.cpp so i tried it out on a AI max 395 with 128GB DDR5 8000:
I rebuilt the radv container from https://github.com/kyuz0/amd-strix-halo-toolboxes with that PR : https://github.com/ggml-org/llama.cpp/pull/22673
I ran that GGUF : https://huggingface.co/am17an/Qwen3.6-35BA3B-MTP-GGUF/tree/main and added --spec-type mtp --spec-draft-n-max 3

Result : between 60 and 80 token/s from 40ish token/s without MTP (on the screen i was trying rocm but it's more like 40-45 token/s with vulkan) depending on the subject (some common math stuff seems to be the fastest). PP seems unchanged. The two GGUF on the screen capture are almost the same size : around 36GB each

I have yet to try it on qwen 3.5 122B and there will be some tweaks to do with launch parameters but it's really impressive !!

submitted by /u/Edenar
[link] [comments]

Black Hat USA

AI Business

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

Dev.to

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool

Dev.to

AI is getting better at doing things, but still bad at deciding what to do?

Reddit r/artificial

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

Dev.to

MTP on strix halo with llama.cpp (PR #22673)

Key Points

Related Articles

Black Hat USA

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool

AI is getting better at doing things, but still bad at deciding what to do?

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer