Speculative decoding in llama.cpp for Gemma 4 31B IT / Qwen 3.5 27B?

Reddit r/LocalLLaMA / 4/12/2026

💬 OpinionSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The post asks whether speculative decoding has been tested in llama.cpp specifically with Gemma 4 31B IT and/or Qwen 3.5 27B.
For Gemma, the proposer considers using a smaller same-family draft model to create speculative tokens.
For Qwen 3.5, the proposer is uncertain whether speculative decoding functions well or yields benefits in llama.cpp.
The question seeks community guidance on which draft model combinations work best and whether they produce measurable real-world speedups.

Has anyone here tested speculative decoding in llama.cpp with Gemma 4 31B IT or Qwen 3.5 27B?

For Gemma, I was thinking about using a smaller same-family draft model.
For Qwen 3.5, I’m not sure if it works well at all in llama.cpp.

If you tried it, which draft model worked best and did you get a real speedup?

AI Business

AI Business

Dev.to

Dev.to

Dev.to