| Happy to report that llama.cpp MTP support is now in beta, thanks to Aman (and all the others that have pushed the various issues in the meantime). This has the potential to actually get merged soon-ish. Currently contains support for Qwen3.5 MTP, but other models are likely to follow suit. Between this and the maturing tensor-parallel support, expect most performance gaps between llama.cpp and vLLM, at least when it comes to token generation speeds, to be erased. [link] [comments] |
Llama.cpp MTP support now in beta!
Reddit r/LocalLLaMA / 5/4/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The article reports that llama.cpp now has MTP (multi-device/multi-stream style) support available in beta.
- The beta work is attributed to Aman and other contributors who submitted and progressed related issues, and the change may be merged soon.
- Current MTP support covers the Qwen3.5 MTP model, with expectation that additional models will be added later.
- Combined with improving tensor-parallel support, the update could reduce or eliminate performance gaps between llama.cpp and vLLM for token generation speeds.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge

CLMA Frame Test
Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to

Roundtable chat with Talkie-1930 and Gemma 4 31B
Reddit r/LocalLLaMA