Speculative decoding with Gemma-4-31B + Gemma-4-E2B enables 120 - 200 tok/s output speed for specific tasks

Reddit r/LocalLLaMA / 4/26/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • The author shares a practical setup using speculative decoding with Gemma-4-31B and Gemma-4-E2B to achieve much faster LLM output speeds for specific, non-agentic tasks.
  • In their tests for multilingual (non-English) workflows with relatively small context windows (about 2k–6k tokens per request), quality was reported as better than Gemini 2.5 Flash-lite, including fewer issues like unexpected looping.
  • They configured the models via GGUF quantizations (gemma-4-31B Q6_K_L and gemma-4-E2B Q8_0) and report output throughput around 120–200 tokens per second while running locally.
  • The approach requires about 31.5GB of VRAM, nearly fitting their RTX 5090, and they note the next step is validating performance at larger scale.
  • They conclude that for lightweight extraction/classification tasks requiring structured JSON, running locally can reduce or remove the need for Vertex API (subject to further testing).
Speculative decoding with Gemma-4-31B + Gemma-4-E2B enables 120 - 200 tok/s output speed for specific tasks

So for my project I was using up until now either Gemini 3 / 2.5 Flash or Flash-lite. All my use cases are not agentic, simply LLM workflows for atomic tasks like extracting references from the law, classifying, adjusting titles to nominative case and so on. All this happens in non-English (LT) language, that's one of the reasons I originally used Google models, as multilingual quality is very great for small base languages.

Each single request usually fits in 2k - 6k tokens context.

Recently I found that at least Gemini 2.5 Flash-lite started to produce horrible results, even started looping which I never experienced before, not sure if coincidence or something changed internally in Vertex API / their models.

Since I have RTX 5090, I decided to give it a try with Gemma 4 31B.

My requirements are quite simple - as good as possible at non-English languages, good at producing structured JSON responses, context up to 8K and output speed as fast as possible.

So to squeeze the best possible quality I tried to run gemma-4-31B-it-GGUF:Q6_K_L + gemma-4-E2B-it-GGUF:Q8_0 speculative decoding.

And well, what I can say at least for my initial small sample testing, I can be sure that quality is better than Gemini 2.5 Flash-lite, it is faster and runs locally. The output speeds I get are around 130 - 200 tok / s which is incredible for the quality I'm getting. Setup uses 31.5 GB of VRAM, which barelly fits into my GPU.

My point is that for lightweight LLM workflows such as data extraction and similar tasks I no longer need Vertex API.

Of course the second step is to try it at larger scale instead of just a few simple tests.

https://preview.redd.it/m9j3wzb2bjxg1.png?width=856&format=png&auto=webp&s=15e6b2db2649e4d49f5bf04b0b0f618482ae88d8

Just wanted to share for others that might have similar use cases - it is worth a try, adding my llama command:

./build/bin/llama-server \ -hf bartowski/google_gemma-4-31B-it-GGUF:Q6_K_L \ -hfd unsloth/gemma-4-E2B-it-GGUF:Q8_0 \ -ngl 99 -ngld 99 -fa 1 \ -c 8192 \ --draft-max 12 --draft-min 2 \ --parallel 1 \ --cache-type-k q8_0 --cache-type-v q8_0 \ --reasoning-budget 0 --no-mmproj \ --host 0.0.0.0 --port 8080 \ --temp 1.0 --top-p 0.95 --top-k 64 
submitted by /u/Clasyc
[link] [comments]