Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA / 5/8/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The author benchmarks DFlash speculative decoding in vLLM 0.19.2rc1 using the cyankiwi Gemma-4-26B 4-bit AWQ model on a single RTX 5090 (32GB VRAM).
  • With DFlash disabled, the system reaches about 228 output tokens/sec and roughly 4455 ms mean end-to-end latency for a workload of 256 input tokens and 1024 output tokens.
  • The best practical DFlash configuration tested is num_speculative_tokens=13 and max_num_batched_tokens=8192, improving throughput to ~578 output tokens/sec and reducing mean latency to ~1738 ms (about 2.56x speedup).
  • The fastest average setting is not necessarily the best serving configuration: using max_num_batched_tokens=4096 slightly improves mean latency but worsens p95 tail latency, while 8192 produces a cleaner tail.
  • The post shares a recommended command along with a video and charts/scripts, inviting others to validate similar optimal speculative-token counts with other GPUs or Gemma/Qwen models.

I ran a benchmark to see how much DFlash speculative decoding actually helps in vLLM.

Setup:

  • GPU: RTX 5090, 32GB VRAM
  • vLLM: 0.19.2rc1
  • Main model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
  • Draft model: z-lab/gemma-4-26B-A4B-it-DFlash
  • Workload: random dataset, 256 input tokens, 1024 output tokens
  • Concurrency: 1
  • Request rate: 1
  • Tested num_speculative_tokens from 0 to 15

The short version:

Baseline without DFlash:

  • ~228 output tok/s
  • ~4455 ms mean E2E latency

Best practical DFlash setting:

  • num_speculative_tokens=13
  • max_num_batched_tokens=8192
  • ~578 output tok/s
  • ~1738 ms mean E2E latency
  • ~2.56x speedup

One interesting thing: the fastest average setting was not automatically the best serving setting. num_speculative_tokens=13 with max_num_batched_tokens=4096 had slightly better mean latency, but worse p95. Moving to 8192 gave a cleaner tail.

I made a short video showing the setup, script, benchmark method, graphs, and final recommended command:

https://youtu.be/S_zbHH5Ycs0

Charts / script / results:

https://medium.com/@ttio2tech_28094/3a7ac4f73e5d

Curious if others are seeing similar optimal speculative-token counts with DFlash, especially on 4090/5090 or different Gemma/Qwen models.

submitted by /u/chain-77
[link] [comments]