Speculative decoding works great for Gemma 4 31B in llama.cpp

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The post reports that speculative decoding in llama.cpp can yield about an ~11% speedup when using Gemma 3 270B as the draft model with Gemma 4 31B as the main model.
It provides concrete llama-cli command-line flags to reproduce the setup, including using `--no-mmproj` and specifying the draft model via `-hfd`.
Testing on a single NVIDIA 3090 showed generation throughput tradeoffs alongside a measured draft acceptance rate of ~0.44 (820 accepted / 1863 generated).
Compared to running without speculative decoding, the prompt processing rate stayed similar/slightly improved while generation rate was somewhat lower, but overall performance improved for the author’s configuration.
The author suggests speculative decoding is particularly effective for this Gemma pairing, implying tuning draft model size/quality can materially affect local LLM latency.

I get a ~11% speed up with Gemma 3 270B as the draft model. Try it by adding:

--no-mmproj -hfd unsloth/gemma-3-270m-it-GGUF:Q8_0

Testing with (on a 3090):

./build/bin/llama-cli -hf unsloth/gemma-4-31B-it-GGUF:Q4_1 --jinja --temp 1.0 --top-p 0.95 --top-k 64 -ngl 1000 -st -f prompt.txt --no-mmproj -hfd unsloth/gemma-3-270m-it-GGUF:Q8_0

Gave me:

[ Prompt: 607.3 t/s | Generation: 36.6 t/s ]
draft acceptance rate = 0.44015 ( 820 accepted / 1863 generated)

vs.

[ Prompt: 613.8 t/s | Generation: 32.9 t/s ]

submitted by /u/Leopold_Boom
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

How a Young Founder Scaled a Gamified App to $14K/Month in Just 4 Months

Dev.to

Graph Neural Ordinary Differential Equations

Dev.to

Explainable Causal Reinforcement Learning for deep-sea exploration habitat design with zero-trust governance guarantees

Dev.to

Speculative decoding works great for Gemma 4 31B in llama.cpp

Key Points

Related Articles

Black Hat USA

Black Hat Asia

How a Young Founder Scaled a Gamified App to $14K/Month in Just 4 Months

Graph Neural Ordinary Differential Equations

Explainable Causal Reinforcement Learning for deep-sea exploration habitat design with zero-trust governance guarantees

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer