Gemma-4-31B NVFP4 inference numbers on 1x RTX Pro 6000

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The author benchmarks NVIDIA’s Hugging Face checkpoint `nvidia/Gemma-4-31B-IT-NVFP4` (NVFP4 ~32GB) on a single `1x RTX Pro 6000`, with inference run in steady-state using vLLM and Locust.
They report that KV-cache memory is a major VRAM driver for this setup, so they reduced KV-cache precision to FP8 to fit and stabilize performance.
Generation throughput per user (e.g., ~36–40 tok/s at 1K–8K context for 1 user) degrades as context length increases and as concurrency rises, with sharp drops at very long contexts (e.g., 96K with 4 users).
Time to first token (TTFT) also increases significantly with longer prompts and higher concurrency, ranging from ~0.1–0.2s at 1K with 1–4 users up to ~47.7s at 128K for 1 user.
The post includes additional capacity testing at 8K context to estimate how many concurrent users can be supported while maintaining interactive latency and throughput.

Ran a quick inference sweep on gemma 4 31B in NVFP4 (using nvidia/Gemma-4-31B-IT-NVFP4). The NVFP4 checkpoint is 32GB, half of the BF16 size from google (63GB), likely a mix of BF16 and FP4 roughly equal to FP8 in size. This model uses a ton of VRAM for kv cache. I dropped the kv cache precision to FP8.

All numbers are steady-state averages under sustained load using locust and numbers below are per-user metrics to show user interactivity. 1K output. vLLM.

Per-User Generation Speed (tok/s)

Context	1 User	2 Users	3 Users	4 Users
1K	40.7	36.6	36.1	35.1
8K	39.9	36.5	34.8	32.7
32K	40.5	28.9	25.3	23.5
64K	44.5	27.4	26.7	14.3
96K	34.4	19.5	12.5	9.5
128K	38.3	-	-	-

Time to First Token

Context	1 User	2 Users	3 Users	4 Users
1K	0.1s	0.1s	0.2s	0.2s
8K	1.0s	1.4s	1.7s	2.0s
32K	5.5s	8.1s	10.0s	12.6s
64K	15.3s	22.4s	27.7s	28.7s
96K	29.6s	42.3s	48.6s	56.7s
128K	47.7s	-	-	-

Additional tests at 8k context to find user capacity

Concurrent	1	2	3	4	23	25	30	32
Decode (tok/s)	39.9	36.5	34.8	32.8	22.5	18.5	16.6	15.3
TTFT	1.0s	1.4s	1.7s	2.0s	7.7s	7.4s	8.9s	9.3s

Decode speed is in the same ballpark as Qwen3.5 27B FP8 on this GPU. But prefill is much slower. Definitely need to enable caching to make long context usable especially for multiple users.

I'll retest if there are noticeable performance improvements over the next few days. I'm also looking for FP8 checkpoints for the other Gemma models to test. No point in testing the BF16 weights on this card.

submitted by /u/jnmi235
[link] [comments]