Throughput and TTFT comparisons of Qwen 3.6 27B, Qwen 3.6 35B A3B and Gemma 4 models on H100

Reddit r/LocalLLaMA / 4/25/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The benchmark compares vLLM performance on a single H100 (80GB) for Qwen 3.6 27B, Qwen 3.6 35B A3B, and Gemma 4 models, measuring both throughput (tokens/sec) and TTFT (time to first token).
Gemma 4 expert variants (Gemma 4 E2B-it) strongly outperform the dense 31B model, achieving about 3,180 TPS at 16 concurrent users versus about 226 TPS for Gemma 4 31B dense on the same GPU.
The TTFT results show an even larger user-experience gap, with Gemma 4 E2B-it at roughly 55 ms compared with about 4.1 seconds for Gemma 4 31B dense, highlighting major latency sensitivity.
FP8 quantization is a key optimization: Qwen 3.6 35B MoE in FP8 runs about 73% faster than BF16 with lower TTFT, while dense 27B sees only a ~27% gain, suggesting MoE memory/weight-movement benefits more from FP8 on H100.
The dense 27B and 31B models degrade under higher concurrency on a single GPU, and the author recommends prioritizing Gemma 4 MoE for latency, while using Qwen 3.6 35B-A3B (FP8) for a quality/speed balance and treating dense 31B as batch-oriented.

Throughput and TTFT comparisons of Qwen 3.6 27B, Qwen 3.6 35B A3B and Gemma 4 models on H100

I wanted to figure out which of the newer small and mid-size models are actually worth running on a single H100, so I put 8 of them through a proper vLLM benchmark and recorded what came out.

The setup was simple. One H100 80GB, vLLM 0.19.1, the built-in vllm bench serve tool, 100 prompts per run, 128 input tokens and 128 output tokens. We ran each model at four different concurrency levels (1, 4, 8, and 16 simultaneous requests) and measured two things:

- Throughput in tokens / second, which tells you how much the GPU can produce overall once requests are flowing.

- Time to first token in milliseconds, which is how long a user waits before they see anything appear. This is the thing that makes a chat feel snappy or laggy.

The main finding is that the small Gemma expert models absolutely dominated. At 16 concurrent users, Gemma 4 E2B-it pushed 3,180 TPS while Gemma 4 31B dense managed only 226 on the same GPU. That is roughly 14x the throughput from a model one fifteenth the size. The TTFT gap was even wider: 55 ms versus 4.1 seconds. The difference between a product that feels instant and one that feels broken.

FP8 quantization was the second standout. Qwen 3.6 35B MoE in FP8 was 73% faster than BF16, with lower TTFT too. The dense Qwen 27B pair only saw 27% from FP8, closer to what people usually expect. MoE benefits so much more because those models are bottlenecked on moving expert weights through memory, and FP8 cuts that traffic in half. So FP8 is not just a memory saver anymore. For MoE on H100, it is genuinely faster with no real downside in normal use.

The 3rd thing worth knowing is that Gemma 31B dense falls apart under load on a single GPU. It is fine at low concurrency, but past 4 users the latency explodes. If you want to serve a 30B-class model on one H100, go MoE. Treat the dense 31B as a batch model.

For anyone trying to pick a model right now, here's my thoughts:

- Latency-sensitive chat: Gemma 4 E2B-it. Nothing else is close.

- High throughput or batch: Gemma 4 E2B-it, with E4B as a step up if you need more capability.

- Best balance of quality and speed: Qwen 3.6 35B-A3B in FP8. Around 1,200 tok/s at reasonable latency.

- Skip: Dense 27B and 31B. Outclassed by their MoE and FP8 cousins on the same hardware.

Disclosure: The complete experimentation setup, evaluation and analysis was performed end to end by Neo AI Engineer based on my initial task prompt and then I also evaluated it manually.

I'm happy to learn what SLMs are you deploying currently for latency sensitive ops?

submitted by /u/gvij
[link] [comments]