Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA / 5/8/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Read original →

共有:

Key Points

The author benchmarks DFlash speculative decoding in vLLM 0.19.2rc1 using the cyankiwi Gemma-4-26B 4-bit AWQ model on a single RTX 5090 (32GB VRAM).
With DFlash disabled, the system reaches about 228 output tokens/sec and roughly 4455 ms mean end-to-end latency for a workload of 256 input tokens and 1024 output tokens.
The best practical DFlash configuration tested is num_speculative_tokens=13 and max_num_batched_tokens=8192, improving throughput to ~578 output tokens/sec and reducing mean latency to ~1738 ms (about 2.56x speedup).
The fastest average setting is not necessarily the best serving configuration: using max_num_batched_tokens=4096 slightly improves mean latency but worsens p95 tail latency, while 8192 produces a cleaner tail.
The post shares a recommended command along with a video and charts/scripts, inviting others to validate similar optimal speculative-token counts with other GPUs or Gemma/Qwen models.

I ran a benchmark to see how much DFlash speculative decoding actually helps in vLLM.

Setup:

GPU: RTX 5090, 32GB VRAM
vLLM: 0.19.2rc1
Main model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
Draft model: z-lab/gemma-4-26B-A4B-it-DFlash
Workload: random dataset, 256 input tokens, 1024 output tokens
Concurrency: 1
Request rate: 1
Tested num_speculative_tokens from 0 to 15

The short version:

Baseline without DFlash:

~228 output tok/s
~4455 ms mean E2E latency

Best practical DFlash setting:

num_speculative_tokens=13
max_num_batched_tokens=8192
~578 output tok/s
~1738 ms mean E2E latency
~2.56x speedup

One interesting thing: the fastest average setting was not automatically the best serving setting. num_speculative_tokens=13 with max_num_batched_tokens=4096 had slightly better mean latency, but worse p95. Moving to 8192 gave a cleaner tail.

I made a short video showing the setup, script, benchmark method, graphs, and final recommended command:

https://youtu.be/S_zbHH5Ycs0

Charts / script / results:

https://medium.com/@ttio2tech_28094/3a7ac4f73e5d

Curious if others are seeing similar optimal speculative-token counts with DFlash, especially on 4090/5090 or different Gemma/Qwen models.

submitted by /u/chain-77
[link] [comments]