Speculative decoding works great for Gemma 4 31B in llama.cpp

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The post reports that speculative decoding in llama.cpp can yield about an ~11% speedup when using Gemma 3 270B as the draft model with Gemma 4 31B as the main model.
  • It provides concrete llama-cli command-line flags to reproduce the setup, including using `--no-mmproj` and specifying the draft model via `-hfd`.
  • Testing on a single NVIDIA 3090 showed generation throughput tradeoffs alongside a measured draft acceptance rate of ~0.44 (820 accepted / 1863 generated).
  • Compared to running without speculative decoding, the prompt processing rate stayed similar/slightly improved while generation rate was somewhat lower, but overall performance improved for the author’s configuration.
  • The author suggests speculative decoding is particularly effective for this Gemma pairing, implying tuning draft model size/quality can materially affect local LLM latency.

I get a ~11% speed up with Gemma 3 270B as the draft model. Try it by adding:

--no-mmproj -hfd unsloth/gemma-3-270m-it-GGUF:Q8_0 

Testing with (on a 3090):

./build/bin/llama-cli -hf unsloth/gemma-4-31B-it-GGUF:Q4_1 --jinja --temp 1.0 --top-p 0.95 --top-k 64 -ngl 1000 -st -f prompt.txt --no-mmproj -hfd unsloth/gemma-3-270m-it-GGUF:Q8_0 

Gave me:

[ Prompt: 607.3 t/s | Generation: 36.6 t/s ]
draft acceptance rate = 0.44015 ( 820 accepted / 1863 generated)

vs.

[ Prompt: 613.8 t/s | Generation: 32.9 t/s ]

submitted by /u/Leopold_Boom
[link] [comments]