llama.cpp speculative checkpointing was merged

Reddit r/LocalLLaMA / 4/19/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The speculative checkpointing feature in llama.cpp has been merged via a PR, enabling potential generation speedups for some prompt types.
  • Performance gains vary: some prompts are faster, while others see little to no improvement due to low draft acceptance streaks.
  • Optimal working parameters depend on the task type and repetition patterns present in the input.
  • For coding workloads, the post reports roughly 0% to 50% speedup using specific speculative decoding settings (n-gram spec with tuned draft-min/draft-max).

https://github.com/ggml-org/llama.cpp/pull/19493

Some prompts get a speedup, others don't (cases of low draft acceptance streak).
Good working params depend on the task type and repetition patterns.
For coding, I got some 0%~50% speedup with these params:

--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 
submitted by /u/AdamDhahabi
[link] [comments]