dicking around with the new mtp speculative decode with qwen3.6 27b, and it’s great. but for agentic coding i’ve seen significant improvements from ngram, because a decent fraction of the time (e.g. calling edit tool) the model is just repeating verbatim a section of code that it has already seen before. ngram can speculate on a lot of tokens reeaallly fast in comparison.
it’d be great if we could combine them by using them both at the same time, but it looks like if i add them both to the command line arguments, only ngram is active.
is there any reason both can’t be used simultaneously? fundamental limitation, or just an implementation limit with a fix on the horizon?
EDIT: just looked at the PR again and PmNz8 asked the same question like two hours before i posted this. go give it an updoot! https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4394544777
[link] [comments]




