So I've been messing around trying to get MTP working alongside TBQ4_0 (TurboQuant's lossless 4.25 bpv KV cache) on Qwen3.6-27B for my own use.
So after a day of vibecoding I think I may have gotten something viable. Went from about 43 t/s when I first got it compiling to 80-87 t/s after optimizing. With MTP draft acceptance around 73% on top of that.
Running on:
- RTX 4090 24GB
- Qwen3.6-27B-Heretic-v2 Q4_K_M with grafted MTP heads
- 262K context, TBQ4_0 KV cache, MTP draft 3
- Ubuntu 24.04, CUDA 12.x
I'm not a professional or anything so there's probably room for improvement, but it works and the output quality seems solid. The fork's buildable if anyone wants to try it or poke holes in the approach:
https://github.com/Indras-Mirror/llama.cpp-mtp
Got Deepseek to write up the technical details here if anyone's curious about the kernel architecture:
[link] [comments]

