Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Reddit r/LocalLLaMA / 4/12/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The post reports controlled benchmarks showing that speculative decoding using Gemma 4 E2B as the draft model can significantly speed up inference for the Gemma 4 31B main model.
  • With the same server setup (RTX 5090 32GB, llama.cpp fork with TurboQuant KV cache, 128K context, parallel=1), the author measures an average throughput increase from 57.17 t/s to 73.73 t/s, reported as +29%.
  • Speedups vary by task, reaching about +50% for code generation (+50.5%) with an observed accept rate of ~60.7%.
  • The benchmark setup includes specific speculative decoding parameters (--draft-max 8 --draft-min 1) and discards warm-up runs before timing measurements.
  • The results suggest pairing a smaller E2B draft model with a larger 31B model is an effective practical strategy to improve latency/throughput for local LLM deployments.
Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Following up on my previous Gemma 4 31B benchmark post, I tested speculative decoding with Gemma 4 E2B (4.65B) as the draft model.

The results were much better than I expected, so I wanted to share some controlled benchmark numbers.

Setup

  • GPU: RTX 5090 (32GB VRAM)
  • OS: Windows 11
  • Main model: Gemma 4 31B UD-Q4_K_XL (18.3GB)
  • Draft model: Gemma 4 E2B UD-Q4_K_XL (3.0GB)
  • Backend: llama.cpp fork with TurboQuant KV cache (turbo3)
  • Config: 128K context, parallel=1, Flash Attention, --draft-max 8 --draft-min 1

Benchmark Results

Same server config for both, max_tokens=500, temp=0.7, warm-up query discarded before measuring.

https://preview.redd.it/gjyo1gl1crug1.png?width=1007&format=png&auto=webp&s=6574ab5093a44846d688de2a951f661cbce2013b

Query Type Baseline (t/s) SpecDec (t/s) Accept Rate Speedup
Math explanation 57.45 85.86 62.9% +49.5%
Korean poetry 56.93 62.34 44.1% +9.5%
Code generation 57.15 86.05 60.7% +50.5%
Science explanation 57.19 71.14 50.9% +24.4%
Translation + analysis 57.14 63.26 42.2% +10.7%
Average 57.17 73.73 52.2% +29.0%

Even at 42% acceptance rate, speculative decoding is still +10% faster because there's zero token translation overhead when the vocabs are compatible.

The GGUF Version Trap

I initially got terrible results — the draft model was slower than no draft at all (7.31 t/s vs 57 t/s baseline). Every draft model combo gave this warning:

the target and draft vocabs are not compatible - tokens will be translated between the two 

After digging into speculative.cpp, I found the compatibility check compares add_bos_token between target and draft. My 31B GGUF was from early April when Gemma 4 first dropped, and it had add_bos_token = false. The E2B model (downloaded later) had add_bos_token = true. This single metadata mismatch forced llama.cpp into token translation mode, killing all performance gains.

Re-downloading the 31B GGUF (Unsloth re-quantized all Gemma 4 GGUFs recently with the fix) made the warning disappear and unlocked the full +29% speedup.

TL;DR: If you downloaded your Gemma 4 GGUF in early April 2026, re-download it. The tokenizer metadata was fixed.

Practical Tips

Add these flags to your existing llama-server command:

-md gemma-4-E2B-it-UD-Q4_K_XL.gguf -ngld 99 --draft-max 8 --draft-min 1 --parallel 1 

Things to watch out for:

  • --parallel 1 is mandatory — with auto (=4), the draft model's KV cache is allocated 4x, eating VRAM and tanking speed to 7 t/s
  • No vision — speculative decoding and multimodal can't be used together
  • Q4 draft is fine — Q8 (4.8GB) doesn't improve speed over Q4 (3.0GB), and Q4 leaves more VRAM headroom
  • Extra VRAM ~2.3GB — total ~23.4GB with 128K context on a 32GB card (256K fits too, ~25.5GB).

Content-dependent speedup

The gains scale with how predictable the output is:

  • Code / Math (structured, repetitive patterns): ~60% accept rate → +50% speed
  • Explanations (semi-structured): ~50% accept rate → +24%
  • Creative / Translation (less predictable): ~42% accept rate → +10%

Even the worst case is still a net positive, which is the key difference from having incompatible vocabs where even 65% acceptance rate resulted in zero gains.

draft-max Sweep

Thanks to u/Odd-Ordinary-5922 for the suggestion. Same benchmark setup, only varying --draft-max:

draft-max Math Poetry Code Science Translation Avg (t/s) vs baseline
baseline 57.45 56.93 57.15 57.19 57.14 57.17
2 73.43 60.49 68.69 62.46 62.42 65.50 +14.6%
4 83.31 60.88 73.12 65.29 67.98 70.12 +22.6%
8 85.86 62.34 86.05 71.14 63.26 73.73 +29.0%
16 99.35 62.58 78.74 68.39 58.31 73.47 +28.5%

draft-max 8 is the sweet spot for mixed workloads. 16 pushes math to 99 t/s but regresses on creative/translation, ending up about the same average. Creative text stays flat (~62 t/s) regardless of draft-max — the bottleneck there is acceptance rate, not draft length.

submitted by /u/PerceptionGrouchy187
[link] [comments]