[D] - 1M tokens/second serving Qwen 3.5 27B on B200 GPUs, benchmark results and findings

Reddit r/MachineLearning / 3/27/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The article reports successful deployment and benchmarking of Qwen 3.5 27B (dense, FP8) reaching 1.1M total tokens/sec using vLLM v0.18.0 across 96 NVIDIA B200 GPUs on GKE.
It finds that data parallelism (DP=8) nearly quadrupled throughput compared with using tensor parallelism (TP=8), concluding the model is too small for TP to provide benefit on B200.
Results show strong multi-node scaling efficiency (97.1% at 8 nodes and 96.5% at 12) with TPOT remaining roughly constant (~46ms) as nodes increase.
The author highlights that enabling MTP-1 was critical for GPU utilization (0% without it), while MTP-5 caused a cudaErrorIllegalAddress crash.
It notes that KV-cache-aware routing via Google’s Inference Gateway adds ~35% overhead versus round-robin ClusterIP, and that a single EPP pod becomes the throughput bottleneck under the tested worst-case workload (no prefix cache hits).

Wrote up the process of pushing Qwen 3.5 27B (dense, FP8) to 1.1M total tok/s on 96 B200 GPUs with vLLM v0.18.0.

DP=8 nearly 4x'd throughput over TP=8. Model is too small for tensor parallelism to help on B200s.
MTP-1 mattered more than anything else (GPU utilization was 0% without it). MTP-5 crashed with cudaErrorIllegalAddress.
97.1% scaling efficiency at 8 nodes, 96.5% at 12. TPOT flat at ~46ms regardless of node count.
Inference Gateway (KV-cache-aware routing) added ~35% overhead vs ClusterIP round-robin. Single EPP pod is the bottleneck.

InferenceMAX methodology, input-len=1024, output-len=512, 0% prefix cache hit. Worst-case numbers.

disclosure: I work for Google Cloud.

Dev.to

Dev.to

Dev.to

Dev.to

Dev.to