Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results

Reddit r/LocalLLaMA / 4/10/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

The post documents a week-long benchmark of a 2-GPU inference server built around 2x RTX PRO 6000 Blackwell, reporting verified single-user decode throughput for several models including Qwen3.5-122B at ~198 tok/s with SGLang (b12x+NEXTN) and FP4 quantization.
The author provides end-to-end test details (raw JSONs, launch commands, and methodology) and reports that decode throughput stays roughly constant across context lengths from 4K up to 150K, while TTFT increases with longer contexts.
VRAM accounting suggests sufficient headroom for KV cache given the 131K supported max context (KV budget ~2.4M tokens), countering concerns that high VRAM utilization would prevent stable performance.
Performance is attributed to a “secret sauce” centered on PCIe switch (PIX topology) for low-latency GPU-to-GPU communication plus specific inference optimizations such as SGLang’s b12x MoE kernels, NEXTN speculative decoding, and optimized oneshot allreduce/fusion.
Operational stability required particular system and performance settings (e.g., pci=noacs and uvm_disable_hmm=1), without which the author reports multi-GPU P2P could hang and wedge.

I've been optimizing a 2-GPU inference server for the past week and wanted to share the results. Full data is public with raw JSONs, launch commands, and methodology.

**Hardware:**

- 2x RTX PRO 6000 Blackwell (96GB GDDR7 each)

- EPYC 4564P

- 128GB DDR5 ECC

- c-payne PM50100 Gen5 PCIe switch

- AsRock Rack B650D4U server board

**Results (C=1, single-user decode, tok/s):**

|---|---|---|---|

**Before you ask:**

*"198 tok/s on 122B? No way."*

3-run verified: 197, 200, 198. Also confirmed with curl: 2000 tokens in 12.7s. Raw JSONs linked below.

*"That's just ctx=0 cherry-picking."*

Tested context scaling today at C=1: 4K=1.8s, 16K=2.3s, 57K=7.1s, 150K=23.3s TTFT. No crashes at any length. Decode speed stays ~198 regardless of context — TTFT increases, decode doesn't.

*"85% VRAM utilization leaves no headroom."*

VRAM breakdown per GPU from server logs: weights 39.75GB + KV cache 13.9GB + Mamba state 26.4GB + 13.5GB free. KV budget is 2.4M tokens — model only supports 131K max context. Headroom is fine.

*"Why not just buy a Threadripper?"*

I have one too. This build is 18% faster (198 vs 168 tok/s) because the PCIe switch routes P2P through silicon at sub-microsecond latency instead of through the CPU root complex. For MoE TP decode, every forward pass blocks on dozens of small allreduces. The messages are tiny (10B active params), so bandwidth doesn't matter. Latency per sync does. PIX topology wins on latency, not bandwidth.

**The secret sauce:**

PCIe switch (PIX topology) — GPU-to-GPU through switch fabric, not CPU
SGLang with b12x MoE kernels — 26% faster than FlashInfer CUTLASS
NEXTN speculative decoding — +65% over no speculation
PCIe oneshot allreduce + fusion — optimized multi-GPU communication
modelopt_fp4 checkpoint (txn545) — required for b12x kernels. compressed-tensors checkpoints don't work with b12x
Performance governor + pci=noacs + uvm_disable_hmm=1 — without these, P2P hangs and GPUs wedge

**All data is public:**

- Results & methodology:

[github.com/Visual-Synthesizer/rtx6kpro/benchmarks/results.md](https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/benchmarks/results.md)

- Raw benchmark JSONs:

[github.com/Visual-Synthesizer/rtx6kpro/benchmarks/inference-throughput](https://github.com/Visual-Synthesizer/rtx6kpro/tree/master/benchmarks/inference-throughput)

- 3-run verification data:

[run1](https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/benchmarks/inference-throughput/sglang\_122b\_verify\_run1.json),

[run2](https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/benchmarks/inference-throughput/sglang\_122b\_verify\_run2.json),

[run3](https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/benchmarks/inference-throughput/sglang\_122b\_verify\_run3.json)

Happy to answer questions. If you think the numbers are wrong, the launch commands are in the repo — reproduce it yourself.

submitted by /u/Visual_Synthesizer
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

GLM 5.1 tops the code arena rankings for open models

Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

Dev.to

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results

Key Points

Related Articles

Black Hat USA

Black Hat Asia

GLM 5.1 tops the code arena rankings for open models

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer