Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post

Reddit r/LocalLLaMA / 4/23/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • The author describes an experiment comparing generation speed while using Qwen-3.6-27B with llama.cpp, showing large improvements across successive program versions.
  • Token generation speed increased from 13.60 t/s to 25.53 t/s, then to 68.35 t/s, and finally to 136.75 t/s within the same session using speculative decoding.
  • The post attributes the speed gains to a specific llama-server speculative decoding configuration (ngram speculative decoding with tuned parameters).
  • The author also notes a workflow benefit: Qwen successfully detected and helped fix a bug when the user provided a screenshot with a browser console.
  • They conclude that, while settings may not be optimal, updating llama.cpp and trying speculative decoding can yield substantial practical speedups on local hardware.
Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post

First a little explanation about what is happening in the pictures.

I did a small experiment with the aim of determining how much improvement using speculative decoding brings to the speed of the new Qwen (TL;DR big!).

  1. image shows my simple prompt at the beginning of the session.
  2. image shows time and token generation speed (13.60 t/s) for making the first version of the program. Also it shows my prompt asking for a new feature.
  3. image shows time and token generation speed for a second version of the program (25.53 t/s - you can notice an improvement). Also on the image you can see there was a bug. I presented to Qwen the screenshot with browser console opened. Qwen correctly spotted what kind of bug it is and fixed it.
  4. image shows time and token generation speed for a fixed version of the program (68.35 t/s - big improvement). Also image shows my prompt for making a small change in the program.
  5. image shows time and token generation speed for final version of the program after small change (136.75 t/s !!!)

Last image shows finished beautiful aquarium. Aesthetics and functionality is another level compared with the older models of similar size and many much bigger ones.

So speed goes 13.60 > 25.53 > 68.35 > 136.75 t/s during session. Every time Qwen delivered full code. Similar kind of workflow I use very often. And all this thanks to one simple line in llama-server command

'--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48'.

I am not sure this is the best setting but it works well for me. I will play with it more.

My llama-swap command:

 ${llama-server} -m ${models}/Qwen3.6-27B/Qwen3.6-27B-Q8_0.gguf --mmproj ${models}/Qwen3.6-27B/mmproj-BF16Qwen3.6-27B.gguf --no-mmproj-offload --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48 --ctx-size 128000 --temp 1.0 --top-p 0.95 --top-k 20 --presence_penalty 1.5 --chat-template-kwargs '{"preserve_thinking": true}' 

My linux PC has 40GB VRAM (rtx3090 and rtx4060ti) and 128GB DDR5 RAM.

Big thanks to all smart people who contribute to llamacpp, to this Reddit community and to the Qwen crew.

Free lunch, try it out...

Edit: I forgot to mention some changes in llama.cpp from two days ago. So try to update.

submitted by /u/Then-Topic8766
[link] [comments]