AI Navigate

55 → 282 tok/s: How I got Qwen3.5-397B running at speed on 4x RTX PRO 6000 Blackwell

Reddit r/LocalLLaMA / 3/15/2026

📰 NewsTools & Practical Usage

Key Points

  • The author fixed SM120 GEMM tile issues by patching CUTLASS to support K=64 tiles, enabling faster inference on SM120 GPUs.
  • Through a series of optimizations, throughput improved from 55 tok/s (WSL2) to 282 tok/s (native Linux) for the Qwen3.5-397B model on 4x RTX PRO 6000 Blackwell.
  • The root cause was SM120's 99KB SMEM limitation and a CUTLASS TMA layout bug when K < 128; the patch introduces an EffBlk_SF computation adjustment and folding of scale factors to fit hardware constraints.
  • The results include detailed performance figures across configurations, the hardware and software environment, and a PR submission to FlashInfer with a pre-built Docker image available.

TL;DR: Built a custom CUTLASS kernel to fix SM120's broken MoE GEMM tiles. Went from 55 tok/s (WSL2) → 119 (native Linux) → 142 (driver/config optimization) → 282 tok/s (custom K=64 kernel). PR submitted to FlashInfer, pre-built Docker image available.

The Problem

If you're running NVFP4 MoE models (Qwen3.5-397B, DeepSeek, etc.) on RTX PRO 6000, RTX 5090, or DGX Spark — basically any SM120 Blackwell workstation GPU — you've probably seen this:

Failed to initialize cutlass TMA WS grouped gemm 

The autotuner skips all the SM120 GEMM tiles because they overflow your GPU's 99KB shared memory. Datacenter Blackwell (B200) has 228KB SMEM, so the tiles were designed for that. Your workstation GPU gets stuck on slow fallback kernels.

Result: You're leaving 50%+ of your throughput on the table.

The Fix

The issue is that K=128 tile shapes need more SMEM than SM120 has. K=64 tiles would fit, but CUTLASS had a bug: the TMA scale factor layout assumes K≥128 and creates a layout mismatch when K=64 (Blk_SF=4 but K=64 only has 2 scale factors along K).

I patched sm120_blockscaled_mma_builder.inl in CUTLASS to:

  1. Compute EffBlk_SF = min(K/SFVectorSize, Blk_SF) to handle K<128
  2. Fold scale factors into the basic block when they exceed MMA requirements

This lets K=64 tiles compile and run correctly on SM120's 99KB SMEM.

Results

Hardware: 4x NVIDIA RTX PRO 6000 Blackwell (96GB GDDR7 each, SM 12.0) Model: Qwen3.5-397B-A17B-NVFP4, TP=4, MTP=5 Environment: CUDA 13.2, Driver 595.45.04, vLLM 0.17.1rc1, FlashInfer 0.6.6 The Sehyo version of QWen3.5-397-a17b-NVFP4.

Users Before (tok/s) After (tok/s) Improvement
1 142 283 +99%
4 250 850 +240%
8 510 1,283 +151%

The full journey from WSL2:

Config 1-user tok/s
WSL2 baseline 55
Native Linux 119
+ MTP=5 + config tuning 134
+ Driver 595 + CUDA 13.2 + iommu=pt 142
+ Custom K=64 kernel 283

How to Use It

Pre-built Docker image (easiest)

docker pull verdictai/vllm-blackwell-k64:latest docker run -d --name vllm --gpus all --ipc host --shm-size 32g \ -p 9200:8000 \ -v /path/to/sehyo-qwen35-nvfp4:/model:ro \ -e NCCL_P2P_DISABLE=1 \ -e VLLM_WORKER_MULTIPROC_METHOD=spawn \ verdictai/vllm-blackwell-k64:latest \ python3 -m vllm.entrypoints.openai.api_server \ --model /model --served-model-name qwen3.5-397b-nvfp4 \ --host 0.0.0.0 --port 8000 --trust-remote-code \ --tensor-parallel-size 4 --gpu-memory-utilization 0.85 \ --max-model-len 262144 --enable-prefix-caching \ --reasoning-parser qwen3 --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --speculative-config '{"method":"mtp","num_speculative_tokens":5}' 

Important notes for Threadripper users

  • NCCL_P2P_DISABLE=1 — AMD-Vi IOMMU causes page faults with GPU P2P. Add iommu=pt to kernel params if you want to try P2P instead.
  • Driver 595 — Install from NVIDIA CUDA repo: sudo apt install nvidia-open (after adding the repo). Significant improvement over 580/590 for SM120.

Other optimizations that helped

  • OMP_NUM_THREADS=6 (not 24 — avoids oversubscription with TP=4)
  • CUDA_DEVICE_MAX_CONNECTIONS=32
  • PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
  • MTP=5 for single-user, MTP=3 for multi-user

Upstream PR

FlashInfer PR: https://github.com/flashinfer-ai/flashinfer/pull/2786

The fix is two files:

  1. CUTLASS builder (sm120_blockscaled_mma_builder.inl) — the actual kernel fix
  2. Codegen (generate_kernels.py) — enables K=64 tile generation for SM120

Related CUTLASS issue: https://github.com/NVIDIA/cutlass/issues/3096

Who this helps

Anyone running MoE models with NVFP4 quantization on:

  • RTX PRO 6000 (Blackwell workstation)
  • RTX 5090 (consumer Blackwell)
  • DGX Spark
  • Any SM120/SM121 GPU with ~99KB SMEM

Benchmark Results

Output Length × Concurrency (all values in tok/s)

Output Length 1 User 2 Users (system) 2 Users (per-user) 4 Users (system) 4 Users (per-user)
1,000 278 506 253 857 214
2,000 282 480 240 844 211
8,000 261 468 234 792 198
16,000 231 415 208 732 183
32,000 192 351 175 620 155

Higher Concurrency (1K output tokens)

Users System tok/s Per-user tok/s
1 283 283
4 857 214
8 1,283 160
16 1,624 102

Context Length Scaling (1 user, 1K output)

Input Context tok/s
~128 tokens 283
1K 277
4K 247
16K 183
32K 141

Before vs After (K=64 kernel patch)

Metric Before After Change
1 user decode 142 283 +99%
4 user system 250 857 +243%
8 user system 510 1,283 +151%
16 user system 1,624
8 user per-user 64 160 +150%

The Full Journey

Config 1-user tok/s
WSL2 baseline 55
Native Linux 119
+ MTP=5 + config tuning 134
+ Driver 595 + CUDA 13.2 + iommu=pt 142
+ Custom K=64 kernel 283

If you've been stuck at 110-140 tok/s wondering why the B200 benchmarks show 300+, this is why. The tiles were broken on your hardware.

I want to be transparent about what these numbers represent.

The 283 tok/s figure is measured with thinking mode enabled and a short prompt. Qwen3.5 generates <think></think> tags even when there's nothing to reason about, and MTP (Multi-Token Prediction) achieves near-100% acceptance on these trivial, predictable tokens. This inflates the measured throughput significantly.

With thinking disabled and real prompts (substantive generation — essays, code, detailed explanations), single-user throughput is ~130-136 tok/s. This is the number that matters for actual usage.

Scenario 1 User tok/s Notes
Short prompt, thinking ON 283 MTP inflated by trivial think tokens
Real prompt, thinking ON 161 Think tokens still boost MTP acceptance
Real prompt, thinking OFF ~130-136 Actual usable throughput
Pre-patch baseline (community reports) ~110 Same hardware, no K=64 fix

The K=64 kernel patch still provides a real ~20-25% improvement over the pre-patch baseline on identical hardware. The fix unblocks SM120 GPUs from falling back to slow GEMM paths by giving the autotuner K=64 tiles that fit within 99KB SMEM.

Multi-user throughput with thinking OFF and real prompts:

Users System tok/s Per-user tok/s
1 136 136
2 217 109
4 342 85
8 472 59
16 605 38

I wanted the methodology to be clear to mark the difference between what you might see in "Day to day" use as an end user versus because case scenario engine throughput as I understand it to be bencmarked. Happy to answer questions. This was a wild debugging session — went from "the CUTLASS tiles just don't work on SM120" to "oh, the scale factor SMEM layout has a hardcoded assumption about K≥128" to a working fix in last several nights. lol.

submitted by /u/lawdawgattorney
[link] [comments]