| Following up on my previous Gemma 4 31B benchmark post, I tested speculative decoding with Gemma 4 E2B (4.65B) as the draft model. The results were much better than I expected, so I wanted to share some controlled benchmark numbers. Setup
Benchmark ResultsSame server config for both, max_tokens=500, temp=0.7, warm-up query discarded before measuring.
Even at 42% acceptance rate, speculative decoding is still +10% faster because there's zero token translation overhead when the vocabs are compatible. The GGUF Version TrapI initially got terrible results — the draft model was slower than no draft at all (7.31 t/s vs 57 t/s baseline). Every draft model combo gave this warning: After digging into Re-downloading the 31B GGUF (Unsloth re-quantized all Gemma 4 GGUFs recently with the fix) made the warning disappear and unlocked the full +29% speedup. TL;DR: If you downloaded your Gemma 4 GGUF in early April 2026, re-download it. The tokenizer metadata was fixed. Practical TipsAdd these flags to your existing llama-server command: Things to watch out for:
Content-dependent speedupThe gains scale with how predictable the output is:
Even the worst case is still a net positive, which is the key difference from having incompatible vocabs where even 65% acceptance rate resulted in zero gains. draft-max SweepThanks to u/Odd-Ordinary-5922 for the suggestion. Same benchmark setup, only varying
draft-max 8 is the sweet spot for mixed workloads. 16 pushes math to 99 t/s but regresses on creative/translation, ending up about the same average. Creative text stays flat (~62 t/s) regardless of draft-max — the bottleneck there is acceptance rate, not draft length. [link] [comments] |
Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)
Reddit r/LocalLLaMA / 4/12/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The post reports controlled benchmarks showing that speculative decoding using Gemma 4 E2B as the draft model can significantly speed up inference for the Gemma 4 31B main model.
- With the same server setup (RTX 5090 32GB, llama.cpp fork with TurboQuant KV cache, 128K context, parallel=1), the author measures an average throughput increase from 57.17 t/s to 73.73 t/s, reported as +29%.
- Speedups vary by task, reaching about +50% for code generation (+50.5%) with an observed accept rate of ~60.7%.
- The benchmark setup includes specific speculative decoding parameters (--draft-max 8 --draft-min 1) and discards warm-up runs before timing measurements.
- The results suggest pairing a smaller E2B draft model with a larger 31B model is an effective practical strategy to improve latency/throughput for local LLM deployments.
Related Articles

Black Hat USA
AI Business

Black Hat Asia
AI Business

Your developers are already running AI locally: Why on-device inference is the CISO’s new blind spot
VentureBeat

ChatGPT Prompt Engineering for Freelancers: A Step-by-Step Guide to Unlocking AI-Powered Client Acquisition
Dev.to

From Batch to Bot: AI for Specialty Food Label Compliance
Dev.to