Qwen 3.5 27B at 1.1M tok/s on B200s, all configs on GitHub

Reddit r/LocalLLaMA / 3/27/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • Qwen 3.5 27B (dense variant) reportedly reaches 1,103,941 tokens/second on 12-node clusters with 96 NVIDIA B200 GPUs using vLLM.
  • The performance jump is attributed to four main configuration/implementation changes: DP=8 over TP=8, reducing context window from 131K to 4K, using FP8 KV cache, and enabling MTP-1 speculative decoding.
  • MTP-1 is highlighted as the biggest lever: without it, GPU utilization reportedly drops to 0%, while with it the system sustains very high utilization.
  • Scaling results show ~97.1% efficiency at 8 nodes and ~96.5% at 12 nodes using ClusterIP round-robin, and the KV-cache-aware routing option adds ~35% overhead so it was avoided.
  • The team says they used vLLM v0.18.0 out of the box with no custom kernels, with additional “GDN” kernel optimizations planned upstream, and they published all configurations on GitHub.

Pushed Qwen 3.5 27B (the dense one, not MoE) to 1,103,941 tok/s on 12 nodes with 96 B200 GPUs using vLLM.

9,500 to 95K per node came from four changes: DP=8 over TP=8, context window from 131K to 4K, FP8 KV cache, and MTP-1 speculative decoding. That last one was the biggest -- without MTP, GPU utilization was 0%.

Scaling: 97.1% efficiency at 8 nodes, 96.5% at 12. ClusterIP round-robin. The Inference Gateway with KV-cache-aware routing added 35% overhead, so we didn't use it.

No custom kernels, vLLM v0.18.0 out of the box. GDN kernel optimizations still coming upstream.

https://medium.com/google-cloud/1-million-tokens-per-second-qwen-3-5-27b-on-gke-with-b200-gpus-161da5c1b592

disclosure: I work for Google Cloud.

submitted by /u/m4r1k_
[link] [comments]