Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Reddit r/LocalLLaMA / 5/9/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author reports successfully getting MTP working together with TurboQuant (lossless 4.25 bpv KV cache) on the Qwen3.6-27B model.
After initial compilation achieved ~43 t/s, optimizations reportedly increased throughput to about 80–87 t/s, with MTP draft acceptance around 73%.
The setup runs on a single RTX 4090 (24GB) using a 262K context length, TBQ4_0 KV cache, and MTP draft 3 under Ubuntu 24.04 with CUDA 12.x.
The work is shared as a buildable fork of llama.cpp-mtp, alongside a separate post by the same author detailing the kernel architecture using Deepseek.
The author emphasizes that while they are not a professional, the approach appears to work reliably and produces solid output quality, inviting others to test and critique it.

So I've been messing around trying to get MTP working alongside TBQ4_0 (TurboQuant's lossless 4.25 bpv KV cache) on Qwen3.6-27B for my own use.

So after a day of vibecoding I think I may have gotten something viable. Went from about 43 t/s when I first got it compiling to 80-87 t/s after optimizing. With MTP draft acceptance around 73% on top of that.

Running on:

- RTX 4090 24GB

- Qwen3.6-27B-Heretic-v2 Q4_K_M with grafted MTP heads

- 262K context, TBQ4_0 KV cache, MTP draft 3

- Ubuntu 24.04, CUDA 12.x

I'm not a professional or anything so there's probably room for improvement, but it works and the output quality seems solid. The fork's buildable if anyone wants to try it or poke holes in the approach:

https://github.com/Indras-Mirror/llama.cpp-mtp

Got Deepseek to write up the technical details here if anyone's curious about the kernel architecture:

https://indrasmirror.au/blog-mtp-shared-tensors-200k.html

submitted by /u/indrasmirror
[link] [comments]