Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results

Reddit r/LocalLLaMA / 4/10/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The post documents a week-long benchmark of a 2-GPU inference server built around 2x RTX PRO 6000 Blackwell, reporting verified single-user decode throughput for several models including Qwen3.5-122B at ~198 tok/s with SGLang (b12x+NEXTN) and FP4 quantization.
  • The author provides end-to-end test details (raw JSONs, launch commands, and methodology) and reports that decode throughput stays roughly constant across context lengths from 4K up to 150K, while TTFT increases with longer contexts.
  • VRAM accounting suggests sufficient headroom for KV cache given the 131K supported max context (KV budget ~2.4M tokens), countering concerns that high VRAM utilization would prevent stable performance.
  • Performance is attributed to a “secret sauce” centered on PCIe switch (PIX topology) for low-latency GPU-to-GPU communication plus specific inference optimizations such as SGLang’s b12x MoE kernels, NEXTN speculative decoding, and optimized oneshot allreduce/fusion.
  • Operational stability required particular system and performance settings (e.g., pci=noacs and uvm_disable_hmm=1), without which the author reports multi-GPU P2P could hang and wedge.

I've been optimizing a 2-GPU inference server for the past week and wanted to share the results. Full data is public with raw JSONs, launch commands, and methodology.

**Hardware:**

- 2x RTX PRO 6000 Blackwell (96GB GDDR7 each)

- EPYC 4564P

- 128GB DDR5 ECC

- c-payne PM50100 Gen5 PCIe switch

- AsRock Rack B650D4U server board

**Results (C=1, single-user decode, tok/s):**

| Model | tok/s | Engine | Config |

|---|---|---|---|

| Qwen3.5-122B NVFP4 | 198 | SGLang b12x+NEXTN | modelopt_fp4, speculative decode |

| Qwen3.5-27B FP8 | 170 | vLLM DFlash | 2B drafter, 2 GPU |

| MiniMax M2.5 NVFP4 | 148 | vLLM b12x Docker | modelopt_fp4 |

| Qwen3.5-122B NVFP4 | 131 | vLLM MTP=1 | compressed-tensors |

| Qwen3.5-397B GGUF | 79 | llama.cpp | UD-Q3_K_XL, fully in VRAM |

**Before you ask:**

*"198 tok/s on 122B? No way."*

3-run verified: 197, 200, 198. Also confirmed with curl: 2000 tokens in 12.7s. Raw JSONs linked below.

*"That's just ctx=0 cherry-picking."*

Tested context scaling today at C=1: 4K=1.8s, 16K=2.3s, 57K=7.1s, 150K=23.3s TTFT. No crashes at any length. Decode speed stays ~198 regardless of context — TTFT increases, decode doesn't.

*"85% VRAM utilization leaves no headroom."*

VRAM breakdown per GPU from server logs: weights 39.75GB + KV cache 13.9GB + Mamba state 26.4GB + 13.5GB free. KV budget is 2.4M tokens — model only supports 131K max context. Headroom is fine.

*"Why not just buy a Threadripper?"*

I have one too. This build is 18% faster (198 vs 168 tok/s) because the PCIe switch routes P2P through silicon at sub-microsecond latency instead of through the CPU root complex. For MoE TP decode, every forward pass blocks on dozens of small allreduces. The messages are tiny (10B active params), so bandwidth doesn't matter. Latency per sync does. PIX topology wins on latency, not bandwidth.

**The secret sauce:**

  1. PCIe switch (PIX topology) — GPU-to-GPU through switch fabric, not CPU

  2. SGLang with b12x MoE kernels — 26% faster than FlashInfer CUTLASS

  3. NEXTN speculative decoding — +65% over no speculation

  4. PCIe oneshot allreduce + fusion — optimized multi-GPU communication

  5. modelopt_fp4 checkpoint (txn545) — required for b12x kernels. compressed-tensors checkpoints don't work with b12x

  6. Performance governor + pci=noacs + uvm_disable_hmm=1 — without these, P2P hangs and GPUs wedge

    **All data is public:**

- Results & methodology:

[github.com/Visual-Synthesizer/rtx6kpro/benchmarks/results.md](https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/benchmarks/results.md)

- Raw benchmark JSONs:

[github.com/Visual-Synthesizer/rtx6kpro/benchmarks/inference-throughput](https://github.com/Visual-Synthesizer/rtx6kpro/tree/master/benchmarks/inference-throughput)

- 3-run verification data:

[run1](https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/benchmarks/inference-throughput/sglang\_122b\_verify\_run1.json),

[run2](https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/benchmarks/inference-throughput/sglang\_122b\_verify\_run2.json),

[run3](https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/benchmarks/inference-throughput/sglang\_122b\_verify\_run3.json)

Happy to answer questions. If you think the numbers are wrong, the launch commands are in the repo — reproduce it yourself.

submitted by /u/Visual_Synthesizer
[link] [comments]