Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Reddit r/LocalLLaMA / 4/12/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The post reports controlled benchmarks showing that speculative decoding using Gemma 4 E2B as the draft model can significantly speed up inference for the Gemma 4 31B main model.
With the same server setup (RTX 5090 32GB, llama.cpp fork with TurboQuant KV cache, 128K context, parallel=1), the author measures an average throughput increase from 57.17 t/s to 73.73 t/s, reported as +29%.
Speedups vary by task, reaching about +50% for code generation (+50.5%) with an observed accept rate of ~60.7%.
The benchmark setup includes specific speculative decoding parameters (--draft-max 8 --draft-min 1) and discards warm-up runs before timing measurements.
The results suggest pairing a smaller E2B draft model with a larger 31B model is an effective practical strategy to improve latency/throughput for local LLM deployments.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Following up on my previous Gemma 4 31B benchmark post, I tested speculative decoding with Gemma 4 E2B (4.65B) as the draft model.

The results were much better than I expected, so I wanted to share some controlled benchmark numbers.

Setup

GPU: RTX 5090 (32GB VRAM)
OS: Windows 11
Main model: Gemma 4 31B UD-Q4_K_XL (18.3GB)
Draft model: Gemma 4 E2B UD-Q4_K_XL (3.0GB)
Backend: llama.cpp fork with TurboQuant KV cache (turbo3)
Config: 128K context, parallel=1, Flash Attention, --draft-max 8 --draft-min 1

Benchmark Results

Same server config for both, max_tokens=500, temp=0.7, warm-up query discarded before measuring.

https://preview.redd.it/gjyo1gl1crug1.png?width=1007&format=png&auto=webp&s=6574ab5093a44846d688de2a951f661cbce2013b

Query Type	Baseline (t/s)	SpecDec (t/s)	Accept Rate	Speedup
Math explanation	57.45	85.86	62.9%	+49.5%
Korean poetry	56.93	62.34	44.1%	+9.5%
Code generation	57.15	86.05	60.7%	+50.5%
Science explanation	57.19	71.14	50.9%	+24.4%
Translation + analysis	57.14	63.26	42.2%	+10.7%
Average	57.17	73.73	52.2%	+29.0%

Even at 42% acceptance rate, speculative decoding is still +10% faster because there's zero token translation overhead when the vocabs are compatible.

The GGUF Version Trap

I initially got terrible results — the draft model was slower than no draft at all (7.31 t/s vs 57 t/s baseline). Every draft model combo gave this warning:

the target and draft vocabs are not compatible - tokens will be translated between the two

After digging into speculative.cpp, I found the compatibility check compares add_bos_token between target and draft. My 31B GGUF was from early April when Gemma 4 first dropped, and it had add_bos_token = false. The E2B model (downloaded later) had add_bos_token = true. This single metadata mismatch forced llama.cpp into token translation mode, killing all performance gains.

Re-downloading the 31B GGUF (Unsloth re-quantized all Gemma 4 GGUFs recently with the fix) made the warning disappear and unlocked the full +29% speedup.

TL;DR: If you downloaded your Gemma 4 GGUF in early April 2026, re-download it. The tokenizer metadata was fixed.

Practical Tips

Add these flags to your existing llama-server command:

-md gemma-4-E2B-it-UD-Q4_K_XL.gguf -ngld 99 --draft-max 8 --draft-min 1 --parallel 1

Things to watch out for:

--parallel 1 is mandatory — with auto (=4), the draft model's KV cache is allocated 4x, eating VRAM and tanking speed to 7 t/s
No vision — speculative decoding and multimodal can't be used together
Q4 draft is fine — Q8 (4.8GB) doesn't improve speed over Q4 (3.0GB), and Q4 leaves more VRAM headroom
Extra VRAM ~2.3GB — total ~23.4GB with 128K context on a 32GB card (256K fits too, ~25.5GB).

Content-dependent speedup

The gains scale with how predictable the output is:

Code / Math (structured, repetitive patterns): ~60% accept rate → +50% speed
Explanations (semi-structured): ~50% accept rate → +24%
Creative / Translation (less predictable): ~42% accept rate → +10%

Even the worst case is still a net positive, which is the key difference from having incompatible vocabs where even 65% acceptance rate resulted in zero gains.

draft-max Sweep

Thanks to u/Odd-Ordinary-5922 for the suggestion. Same benchmark setup, only varying --draft-max:

draft-max	Math	Poetry	Code	Science	Translation	Avg (t/s)	vs baseline
baseline	57.45	56.93	57.15	57.19	57.14	57.17	—
2	73.43	60.49	68.69	62.46	62.42	65.50	+14.6%
4	83.31	60.88	73.12	65.29	67.98	70.12	+22.6%
8	85.86	62.34	86.05	71.14	63.26	73.73	+29.0%
16	99.35	62.58	78.74	68.39	58.31	73.47	+28.5%

draft-max 8 is the sweet spot for mixed workloads. 16 pushes math to 99 t/s but regresses on creative/translation, ending up about the same average. Creative text stays flat (~62 t/s) regardless of draft-max — the bottleneck there is acceptance rate, not draft length.

submitted by /u/PerceptionGrouchy187
[link] [comments]