Speculative decoding in llama.cpp for Gemma 4 31B IT / Qwen 3.5 27B?

Reddit r/LocalLLaMA / 4/12/2026

💬 OpinionSignals & Early TrendsTools & Practical Usage

Key Points

  • The post asks whether speculative decoding has been tested in llama.cpp specifically with Gemma 4 31B IT and/or Qwen 3.5 27B.
  • For Gemma, the proposer considers using a smaller same-family draft model to create speculative tokens.
  • For Qwen 3.5, the proposer is uncertain whether speculative decoding functions well or yields benefits in llama.cpp.
  • The question seeks community guidance on which draft model combinations work best and whether they produce measurable real-world speedups.

Has anyone here tested speculative decoding in llama.cpp with Gemma 4 31B IT or Qwen 3.5 27B?

For Gemma, I was thinking about using a smaller same-family draft model.
For Qwen 3.5, I’m not sure if it works well at all in llama.cpp.

If you tried it, which draft model worked best and did you get a real speedup?

submitted by /u/No_Algae1753
[link] [comments]