I have totally missed the boat on speculative decoding.
Today when generating some code again for the frontend i found myself staring down at some quite monotonic javascript code. I decided to give a go at the speculative decoding settings of llama.cpp and was pleasantly surprised as i saw a 15-30% speedup in generation for this exact usecase. The code was an arcade game on canvas (lots of simple fors and if statements for boundary checks and simple game logic, a lot of repetitive input).
The settings that i ended up on using on llama-server were these:
--spec-type ngram-mod --spec-ngram-size-n 18 --draft-min 6 --draft-max 48
The model that i used was Gemma4 26B A4B (unsloth quant). On a "add a feature of 60s comic style text effects like bang or pow text highlights with fading them out to alpha channel" , on a piece of brick breaker game (just for the fun of it i tortured llm to implement it with svg graphics instead of canvas) i got the following output, which i recon is actually decent matching:
draft acceptance rate = 0.76429 ( 2727 accepted / 3568 generated)
statistics ngram_mod: #calls(b,g,a) = 2 7342 80, #gen drafts = 84, #acc drafts = 80, #gen tokens = 3880, #acc tokens = 2768, dur(b,g,a) = 1.765, 23.972, 2.707 ms
slot release: id 3 | task 4678 | stop processing: n_tokens = 23670, truncated = 0
Now a question to fellow coders here: what kind of settings do you use on your gemma4 or qwen3.5 setups, if you make use of them at all. I am running low on VRAM here, hence i don't use a draft model.
[link] [comments]


