Share your speculative settings for llama.cpp and Gemma4

Reddit r/LocalLLaMA / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • A user reports achieving a 15–30% generation speedup in llama.cpp by trying speculative decoding for repetitive JavaScript code use cases (e.g., arcade game logic).
  • They share specific llama.cpp settings for ngram-based speculative decoding: `--spec-type ngram-mod --spec-ngram-size-n 18 --draft-min 6 --draft-max 48`.
  • Using Gemma4 26B (unsloth quant) as the target model, they report a draft acceptance rate of ~0.764 (80 accepted / 84 generated drafts) for the tested scenario.
  • The post includes additional generation and timing statistics (accepted vs generated tokens and per-step durations) indicating the speculative path was productively used rather than discarded.
  • The author asks other local-LLM coders what speculative decoding settings they use for Gemma4 or Qwen 3.5, especially in constrained VRAM setups where they may avoid separate draft models.

I have totally missed the boat on speculative decoding.

Today when generating some code again for the frontend i found myself staring down at some quite monotonic javascript code. I decided to give a go at the speculative decoding settings of llama.cpp and was pleasantly surprised as i saw a 15-30% speedup in generation for this exact usecase. The code was an arcade game on canvas (lots of simple fors and if statements for boundary checks and simple game logic, a lot of repetitive input).

The settings that i ended up on using on llama-server were these:

--spec-type ngram-mod --spec-ngram-size-n 18 --draft-min 6 --draft-max 48

The model that i used was Gemma4 26B A4B (unsloth quant). On a "add a feature of 60s comic style text effects like bang or pow text highlights with fading them out to alpha channel" , on a piece of brick breaker game (just for the fun of it i tortured llm to implement it with svg graphics instead of canvas) i got the following output, which i recon is actually decent matching:

draft acceptance rate = 0.76429 ( 2727 accepted / 3568 generated)

statistics ngram_mod: #calls(b,g,a) = 2 7342 80, #gen drafts = 84, #acc drafts = 80, #gen tokens = 3880, #acc tokens = 2768, dur(b,g,a) = 1.765, 23.972, 2.707 ms

slot release: id 3 | task 4678 | stop processing: n_tokens = 23670, truncated = 0

Now a question to fellow coders here: what kind of settings do you use on your gemma4 or qwen3.5 setups, if you make use of them at all. I am running low on VRAM here, hence i don't use a draft model.

submitted by /u/hurdurdur7
[link] [comments]