I've been optimizing a 2-GPU inference server for the past week and wanted to share the results. Full data is public with raw JSONs, launch commands, and methodology.
**Hardware:**
- 2x RTX PRO 6000 Blackwell (96GB GDDR7 each)
- EPYC 4564P
- 128GB DDR5 ECC
- c-payne PM50100 Gen5 PCIe switch
- AsRock Rack B650D4U server board
**Results (C=1, single-user decode, tok/s):**
| Model | tok/s | Engine | Config |
|---|---|---|---|
| Qwen3.5-122B NVFP4 | 198 | SGLang b12x+NEXTN | modelopt_fp4, speculative decode |
| Qwen3.5-27B FP8 | 170 | vLLM DFlash | 2B drafter, 2 GPU |
| MiniMax M2.5 NVFP4 | 148 | vLLM b12x Docker | modelopt_fp4 |
| Qwen3.5-122B NVFP4 | 131 | vLLM MTP=1 | compressed-tensors |
| Qwen3.5-397B GGUF | 79 | llama.cpp | UD-Q3_K_XL, fully in VRAM |
**Before you ask:**
*"198 tok/s on 122B? No way."*
3-run verified: 197, 200, 198. Also confirmed with curl: 2000 tokens in 12.7s. Raw JSONs linked below.
*"That's just ctx=0 cherry-picking."*
Tested context scaling today at C=1: 4K=1.8s, 16K=2.3s, 57K=7.1s, 150K=23.3s TTFT. No crashes at any length. Decode speed stays ~198 regardless of context — TTFT increases, decode doesn't.
*"85% VRAM utilization leaves no headroom."*
VRAM breakdown per GPU from server logs: weights 39.75GB + KV cache 13.9GB + Mamba state 26.4GB + 13.5GB free. KV budget is 2.4M tokens — model only supports 131K max context. Headroom is fine.
*"Why not just buy a Threadripper?"*
I have one too. This build is 18% faster (198 vs 168 tok/s) because the PCIe switch routes P2P through silicon at sub-microsecond latency instead of through the CPU root complex. For MoE TP decode, every forward pass blocks on dozens of small allreduces. The messages are tiny (10B active params), so bandwidth doesn't matter. Latency per sync does. PIX topology wins on latency, not bandwidth.
**The secret sauce:**
PCIe switch (PIX topology) — GPU-to-GPU through switch fabric, not CPU
SGLang with b12x MoE kernels — 26% faster than FlashInfer CUTLASS
NEXTN speculative decoding — +65% over no speculation
PCIe oneshot allreduce + fusion — optimized multi-GPU communication
modelopt_fp4 checkpoint (txn545) — required for b12x kernels. compressed-tensors checkpoints don't work with b12x
Performance governor + pci=noacs + uvm_disable_hmm=1 — without these, P2P hangs and GPUs wedge
**All data is public:**
- Results & methodology:
[github.com/Visual-Synthesizer/rtx6kpro/benchmarks/results.md](https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/benchmarks/results.md)
- Raw benchmark JSONs:
[github.com/Visual-Synthesizer/rtx6kpro/benchmarks/inference-throughput](https://github.com/Visual-Synthesizer/rtx6kpro/tree/master/benchmarks/inference-throughput)
- 3-run verification data:
Happy to answer questions. If you think the numbers are wrong, the launch commands are in the repo — reproduce it yourself.
[link] [comments]


