AI Navigate

I spent 8+ hours benchmarking every MoE backend for Qwen3.5-397B NVFP4 on 4x RTX PRO 6000 (SM120). Here's what I found.

Reddit r/LocalLLaMA / 3/12/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The fastest sustained decode observed was 50.5 tok/s on SM120 with Marlin TP=4 and Marlin W4A16, contradicting circulating claims of 130+ tok/s.
  • Enabling MTP or relying on auto backends generally reduces throughput, with Marlin TP=2+PP=2 reaching 49 tok/s and MTP-enabled setups around 39–40 tok/s.
  • Several backends are hampered by bugs: CUTLASS shows a big speed discrepancy (41 tok/s best-case vs 26 tok/s worst-case due to the same bug); vLLM native CUTLASS produces garbage output; Default TP=4 with auto backend also yields garbage outputs; SGLang 0.5.8 returns NaN.
  • A central blocker is an NVIDIA CUTLASS bug on SM120 that prevents FP4 tensor-core MoE workloads from achieving hardware-claimed speeds, blocking many configurations from improving.
  • The study covered 16 configurations across 4x RTX PRO 6000 (Blackwell) machines, testing multiple Docker images, frameworks, MTP/backends, CUDA versions, and kernel patches to map real-world performance.

The short version: 50.5 tok/s sustained decode is the best I can get, and I'm pretty sure it's the best anyone has actually gotten on SM120 hardware -- despite claims of 130+ tok/s floating around. The reason? NVIDIA's own CUTLASS kernels are broken on their own workstation GPU.


The Setup

  • 4x RTX PRO 6000 Blackwell Workstation Edition (96GB GDDR7 each, 384GB total)
  • SM 12.0 -- this is the desktop/workstation Blackwell, NOT the datacenter B200 (SM 10.0)
  • PCIe Gen5, no NVLink
  • Threadripper 24C/48T, 512GB DDR5
  • Windows 11 + WSL2
  • Model: nvidia/Qwen3.5-397B-A17B-NVFP4 (~140GB, 397B total params, 17B active per token)

16 Configurations Tested

I tested literally everything available: multiple Docker images, two inference frameworks, every MoE backend, MTP on/off, different CUDA versions, EP/PP/TP combinations, and a dozen kernel patches.

Config Backend TP MTP tok/s Verdict
Marlin TP=4, no MTP Marlin W4A16 4 No 50.5 Winner
Marlin TP=2+PP=2 Marlin W4A16 2+PP2 No 49 Close second
Marlin + MTP=2 Marlin W4A16 4 Yes 39-40 MTP makes it SLOWER
CUTLASS Docker (best case) FlashInfer CUTLASS 4 Yes 41 80 fast kernels skipped
CUTLASS Docker (worst case) FlashInfer CUTLASS 4 Yes 26 Same bug, worse fallback
vLLM native CUTLASS CUTLASS 4 Yes ~5 Garbage output
Default TP=4 (auto backend) CUTLASS 4 No 6-7 Garbage output
SGLang 0.5.8 FlashInfer 4 -- NaN Literally NaN
Expert Parallel Marlin 2+EP2 No 1.4-2.6 Don't even try on PCIe
TensorRT-LLM -- -- -- N/A Doesn't support the arch
FlashInfer Sampler Marlin 4 No 5.9 8.6x regression from default

The NVIDIA Bug That's Blocking Everything

Here's the thing that makes this frustrating: the RTX PRO 6000 has FP4 tensor cores. NVIDIA ships NVFP4-quantized models designed to use them. The CUTLASS library has grouped GEMM kernels that should light them up for MoE inference.

But on SM120, all 80 TMA Warp Specialized grouped GEMM tactics fail at initialization. Every single one. The error:

Failed to initialize cutlass TMA WS grouped gemm. Error: Error Internal (cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:60)

So instead of native FP4 compute, you're stuck with Marlin, which dequantizes your FP4 weights to FP16 and runs standard GEMM. You're leaving roughly half the theoretical throughput on the table.

I filed CUTLASS issue #3096. No response from NVIDIA.

The kicker: SM121 (DGX Spark, the other Blackwell variant) DOES work with NVFP4 MoE at 356 TFLOPS. So SM12x can do it -- NVIDIA just hasn't validated the SM120 tile configs.

Why MTP Makes Things Worse

This surprised me. Multi-Token Prediction should help, right? On SM120 with Marlin, it's a -22% regression:

  • Without MTP: 50.5 tok/s
  • With MTP=2: 39.6 tok/s

The MTP draft heads were trained on native FP4 activations. Marlin uses W4A16 dequantization, which produces slightly different activation values. Result: 61-85% acceptance rate vs the expected 89%. The overhead of speculating and rejecting outweighs the benefit.

About Those 130 tok/s Claims

Someone on the community forums has been claiming 130-150 tok/s on the same hardware via custom SGLang/vLLM forks. I pulled both repos and reviewed every commit.

Zero kernel-level changes. The forks modify Python-level quantization config, attention registry, and MTP state management. They use the same broken CUTLASS fallback. The same 80 TMA WS tactics fail.

How do you get 130 tok/s from code that runs at 50 tok/s? Most likely explanation: counting speculative tokens (proposed + rejected) rather than actual output tokens delivered. When you measure wall-clock output over 1000+ tokens, 50.5 tok/s is what you get.

If someone has genuinely hit 130+ tok/s sustained decode with correct output on SM120, I would love to be proven wrong. Show me a generation log with timestamps.

What It Took to Get Here

Just getting to 50.5 tok/s required 12 patches across FlashInfer and vLLM:

  • 7 FlashInfer patches: SM version checks, compute capability mappings, GDC compile flags, CuTe DSL architecture lists
  • 5 vLLM patches: is_device_capability_family(120) checks in MoE backend selection

Submitted upstream: - FlashInfer PR #2725 - vLLM PR #36453

What This Means Practically

50.5 tok/s for a 397B parameter model is genuinely impressive -- it's faster than most people's Llama 70B setups. The model quality is excellent. For single-user workloads, it's very usable.

But it should be 2-3x faster. NVIDIA sells this as a $20K+ professional AI GPU. They ship NVFP4 models for it. The inference path they designed for it doesn't work on it. That's not a software limitation -- it's a bug in NVIDIA's own kernel library that they haven't acknowledged.

Practical Config for Anyone With This Hardware

```bash

The important part: force Marlin, disable MTP

export VLLM_MOE_FORCE_MARLIN=1

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \ --tensor-parallel-size 4 \ --max-model-len 262144 \ --gpu-memory-utilization 0.95 \ --enable-chunked-prefill \ --enable-prefix-caching \ --kv-cache-dtype fp8_e4m3 \ --calculate-kv-scales ```

Don't use --enforce-eager (CUDA graphs help). Don't enable MTP. Don't try expert parallel on PCIe.


Open Issues

Has anyone else been fighting this battle on SM120? Would love to hear from other RTX PRO 6000 / RTX 5090 owners running MoE models.

submitted by /u/lawdawgattorney
[link] [comments]