why llama.cpp can’t combine speculative decode methods?

Reddit r/LocalLLaMA / 5/7/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The author experiments with “mtp speculative decode” using Qwen3.6 27B and finds it promising, but notes that “ngram” can deliver bigger gains for agentic coding because the model often repeats previously seen code verbatim.
They observe that when both methods are passed in llama.cpp command-line arguments, only ngram appears to be active, preventing simultaneous use.
The post asks whether there is a fundamental technical limitation or whether the restriction is purely an implementation/configuration issue that could be fixed.
An edit clarifies that a similar question was raised earlier in a related llama.cpp GitHub pull request discussion, suggesting the issue may already be under consideration there.
The practical takeaway is that users seeking best speculative decoding performance may currently have to choose between these approaches rather than combine them.

dicking around with the new mtp speculative decode with qwen3.6 27b, and it’s great. but for agentic coding i’ve seen significant improvements from ngram, because a decent fraction of the time (e.g. calling edit tool) the model is just repeating verbatim a section of code that it has already seen before. ngram can speculate on a lot of tokens reeaallly fast in comparison.

it’d be great if we could combine them by using them both at the same time, but it looks like if i add them both to the command line arguments, only ngram is active.

is there any reason both can’t be used simultaneously? fundamental limitation, or just an implementation limit with a fix on the horizon?

EDIT: just looked at the PR again and PmNz8 asked the same question like two hours before i posted this. go give it an updoot! https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4394544777

submitted by /u/Qwoctopussy
[link] [comments]