Has anyone here tested speculative decoding in llama.cpp with Gemma 4 31B IT or Qwen 3.5 27B?
For Gemma, I was thinking about using a smaller same-family draft model.
For Qwen 3.5, I’m not sure if it works well at all in llama.cpp.
If you tried it, which draft model worked best and did you get a real speedup?
[link] [comments]




