Llama.cpp MTP support now in beta!

Reddit r/LocalLLaMA / 5/4/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The article reports that llama.cpp now has MTP (multi-device/multi-stream style) support available in beta.
The beta work is attributed to Aman and other contributors who submitted and progressed related issues, and the change may be merged soon.
Current MTP support covers the Qwen3.5 MTP model, with expectation that additional models will be added later.
Combined with improving tensor-parallel support, the update could reduce or eliminate performance gaps between llama.cpp and vLLM for token generation speeds.

Happy to report that llama.cpp MTP support is now in beta, thanks to Aman (and all the others that have pushed the various issues in the meantime). This has the potential to actually get merged soon-ish. Currently contains support for Qwen3.5 MTP, but other models are likely to follow suit.

Between this and the maturing tensor-parallel support, expect most performance gaps between llama.cpp and vLLM, at least when it comes to token generation speeds, to be erased.

submitted by /u/ilintar
[link] [comments]