Qwen 3.5 27B at 1.1M tok/s on B200s, all configs on GitHub

Reddit r/LocalLLaMA / 3/27/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

Qwen 3.5 27B (dense variant) reportedly reaches 1,103,941 tokens/second on 12-node clusters with 96 NVIDIA B200 GPUs using vLLM.
The performance jump is attributed to four main configuration/implementation changes: DP=8 over TP=8, reducing context window from 131K to 4K, using FP8 KV cache, and enabling MTP-1 speculative decoding.
MTP-1 is highlighted as the biggest lever: without it, GPU utilization reportedly drops to 0%, while with it the system sustains very high utilization.
Scaling results show ~97.1% efficiency at 8 nodes and ~96.5% at 12 nodes using ClusterIP round-robin, and the KV-cache-aware routing option adds ~35% overhead so it was avoided.
The team says they used vLLM v0.18.0 out of the box with no custom kernels, with additional “GDN” kernel optimizations planned upstream, and they published all configurations on GitHub.

Pushed Qwen 3.5 27B (the dense one, not MoE) to 1,103,941 tok/s on 12 nodes with 96 B200 GPUs using vLLM.

9,500 to 95K per node came from four changes: DP=8 over TP=8, context window from 131K to 4K, FP8 KV cache, and MTP-1 speculative decoding. That last one was the biggest -- without MTP, GPU utilization was 0%.

Scaling: 97.1% efficiency at 8 nodes, 96.5% at 12. ClusterIP round-robin. The Inference Gateway with KV-cache-aware routing added 35% overhead, so we didn't use it.

No custom kernels, vLLM v0.18.0 out of the box. GDN kernel optimizations still coming upstream.

https://medium.com/google-cloud/1-million-tokens-per-second-qwen-3-5-27b-on-gke-with-b200-gpus-161da5c1b592

disclosure: I work for Google Cloud.

submitted by /u/m4r1k_
[link] [comments]

What Is Artificial Intelligence and How Does It Actually Work?

Dev.to

Forge – Turn Dev Conversations into Structured Decisions

Dev.to

Cortex – A Local-First Knowledge Graph for Developers

Dev.to

45 MCP Tools: Everything Your Claude Agent Can Do with a Wallet

Dev.to

SmartLead Architect: Building an AI-Driven Lead Scoring and Outreach Engine

Dev.to

Qwen 3.5 27B at 1.1M tok/s on B200s, all configs on GitHub

Key Points

Related Articles

What Is Artificial Intelligence and How Does It Actually Work?

Forge – Turn Dev Conversations into Structured Decisions

Cortex – A Local-First Knowledge Graph for Developers

45 MCP Tools: Everything Your Claude Agent Can Do with a Wallet

SmartLead Architect: Building an AI-Driven Lead Scoring and Outreach Engine

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer