RTX 5090 gemma4-26b TG performance report

Reddit r/LocalLLaMA / 4/6/2026

💬 OpinionSignals & Early TrendsTools & Practical Usage

Key Points

  • A Reddit user reports early local testing of a Gemma 4 26B model on an RTX 5090 using a modified vLLM build with NVFP4 support and full-context inference.
  • The model weights occupy about 15.76 GiB, with the rest of GPU memory used for the KV cache.
  • For a storytelling prompt with raw output and no “thinking,” they observe roughly 150 tokens/second (TG).
  • Streaming mode shows a time-to-first-token (TTFT) of about 80 ms, with the user noting that output quality is good.

Nothing exhaustive... but I thought I'd report what I've seen from early testing.

I'm running a modified version of vLLM that has NVFP4 support for gemma4-26b. Weights come in around 15.76 GiB and the remainder is KV cache. I'm running full context as well.

For a "story telling" prompt and raw output with no thinking, I'm seeing about 150 t/s on TG.
TTFT in streaming mode is about 80ms.

Quality is good!

submitted by /u/Nice_Cellist_7595
[link] [comments]