2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

Reddit r/LocalLLaMA / 5/6/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

A new llama.cpp PR adds MTP (Multi-Token Prediction) support for Qwen 3.6 27B, enabling speculative decoding using the model’s built-in tensor layers.
The article reports local results on an M2 Max (96GB) showing about a 2.5x inference speedup (up to ~28 tok/s), while also using 4-bit KV cache compression to reduce memory usage.
The author converted and published MTP-compatible GGUF quantizations on Hugging Face, along with multiple fixes to the Jinja chat template to work reliably across tools.
To use the option, users must compile a custom llama.cpp version from the PR branch, then run llama-server with MTP flags (including spec-type mtp and draft token settings) and a 262K context length configuration.
A current limitation is noted: vision can crash llama.cpp when used alongside MTP (as reported in the PR).

In my initial post, I mentioned using turboquants. However, I forgot to include instructions for building llama.cpp with the corresponding PR. The PR is currently too unstable and there are animated discussions around it. I replaced my recommendations with the standard q4_0 KV cache compression, which has some minor loss.

WARNING: wait before download from HF: I just realised my upload of the new versions with the additional fix in the chat template has not completed yet. I will remove this warning once done

The recent PR to llama.cpp bring MTP support to Qwen 3.6 27B. This uses the built-in tensor layers for speculative decoding. None of the existing GGUF have it, as they need to be converted with this PR.

I have tested it locally on my mac M2 Max 96GB, and the results are amazing: 2.5x speed increase, bringing it to 28 tok/s!

I have converted the most useful quants and uploaded them to HF. Even if you are using apple silicon, you should use those instead of MLX. You can download them here:

https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF

This also includes 7 fixes I made to the original jinja chat template, due to vLLM specificity which broke in other tools:

https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

For now, you will need to compile your own version of llama.cpp to use them. It is fairly simple to do:

```bash git clone --depth 1 https://github.com/ggml-org/llama.cpp.git cd llama.cpp git fetch origin pull/22673/head:mtp-pr && git checkout mtp-pr

cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release cmake --build build --target llama-cli llama-server ```

Then to start serving with the API endpoint, use a command similar to:

bash llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \ --spec-type mtp --spec-draft-n-max 5 \ --cache-type-k q4_0 --cache-type-v q4_0 \ -c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081

Vision currently crashes llama.cpp when used alongside MTP. Reported 2026-05-06 in the current PR.

That's it. Three optimizations in one command:

Flag	What it does	Impact
`--spec-type mtp --spec-draft-n-max 5`	Multi-Token Prediction (built into the model)	2.5x faster generation
`--cache-type-k q4_0 --cache-type-v q4_0`	4-bit KV cache (instead of 16-bit)	Quarter the KV memory
`-c 262144`	262K context window	Full native context on 48 GB Mac with q4_0 KV

Adjust -m, -c, and --cache-type-k/v for your hardware, according to the tables below.

Here are my recommendations based on your hardware:

Apple Silicon

RAM	Quant	KV cache	Max context	Memory used	Vision
16 GB	`IQ2_M`	`q4_0`	48K	12.1 GB	✗
24 GB	`IQ3_M`	`q4_0`	64K	15.3 GB	✗
24 GB	`IQ4_XS`	`q4_0`	32K	16.1 GB	✗
32 GB	`Q4_K_M`	`q8_0`	64K	23.1 GB	✓
32 GB	`IQ4_XS`	`q4_0`	128K	21.8 GB	✓
32 GB	`Q5_K_M`	`q4_0`	80K	23.2 GB	✓
48 GB	`Q6_K`	`q8_0`	128K	34.7 GB	✓
48 GB	`Q5_K_M`	`q4_0`	262K	32.3 GB	✓
48 GB	`Q8_0`	`q8_0`	80K	36.0 GB	✓
64+ GB	`Q8_0`	`q8_0`	262K	54.2 GB	✓

NVIDIA GPU

VRAM	Quant	KV cache	Max context	Memory used	Vision
16 GB	`IQ3_M`	`q4_0`	48K	14.5 GB	✗
16 GB	`IQ2_M`	`q4_0`	64K	13.8 GB	✓
24 GB	`Q4_K_M`	`q4_0`	128K	22.2 GB	✗
24 GB	`IQ4_XS`	`q4_0`	128K	21.8 GB	✓
48 GB	`Q8_0`	`q8_0`	128K	40.8 GB	✓
48 GB	`Q6_K`	`q4_0`	262K	35.0 GB	✓
80 GB	`Q8_0`	`q8_0`	262K	54.2 GB	✓

24 GB Mac: IQ4_XS for quality (32K), or IQ3_M for more context (64K).

32 GB Mac: IQ4_XS with q4_0 reaches 128K (imatrix). Q5_K_M for quality at 80K.

48 GB Mac: Q5_K_M/q4_0 reaches 262K. Q6_K at 128K or Q8_0 at 80K for higher quality.

24 GB GPU: IQ4_XS enables vision at 128K (Q4_K_M can't fit both).

48 GB GPU: Q6_K/q4_0 reaches 262K.

For coding and reasoning, prioritize higher quants with q8_0 KV. For general chat and RAG, IQ4_XS with q4_0 and larger context is often sufficient.

Vision adds 0.9 GB for mmproj. I recommend reserving 8 GB for macOS (you can try pushing it to 4 GB on 16 GB mac). You can increase available VRAM by raising the wired memory limit, e.g. for a 96 GB Mac: sudo sysctl iogpu.wired_limit_mb=90112 (88 GB). Adjust the value for your RAM size. nVidia GPU reserves ~1 GB for CUDA.

submitted by /u/ex-arman68
[link] [comments]