Speculative decoding question, 665% speed increase

Reddit r/LocalLLaMA / 4/20/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The post discusses speculative decoding configurations in llama.cpp, focusing on how different settings (e.g., ngram-map, ngram size, and draft token ranges) affect generation speed.
The author observes large speed differences across models—Gemma 4 sees roughly 2× token generation speed, Qwen 3.6 shows about a 40% increase, and a smaller “Devstrall” model reports an unexpectedly large ~665% increase.
A key follow-up suggests that decoding parameters such as repeat penalty and speculative type (switching to ngram-mod) can substantially change the measured speed gains.
The overall takeaway is that apparent “speedup” varies by model behavior and the exact speculative decoding and sampling parameters, so speed claims need careful replication with consistent settings.

Im using these settings in llama.cpp: --spec-type ngram-map-k --spec-ngram-size-n 24 --draft-min 12 --draft-max 48

Whats the real reason for lets say the prompt is for "minor changes in code", whats differing between models:
Gemma 4 31b: Doubles in tks gen so 100%
Qwen 3.6: Only 40% more speed
Devstrall small: 665% increase in speed (what?)

EDIT:

added --repeat-penalty 1.0 and --spec-type ngram-mod instead for Qwen 3.6, now speed is increased by 140tks over 100tks base in minor edits.

submitted by /u/GodComplecs
[link] [comments]