55 → 282 tok/s: How I got Qwen3.5-397B running at speed on 4x RTX PRO 6000 Blackwell

Reddit r/LocalLLaMA / 3/15/2026

📰 NewsTools & Practical Usage

共有:

Key Points

The author fixed SM120 GEMM tile issues by patching CUTLASS to support K=64 tiles, enabling faster inference on SM120 GPUs.
Through a series of optimizations, throughput improved from 55 tok/s (WSL2) to 282 tok/s (native Linux) for the Qwen3.5-397B model on 4x RTX PRO 6000 Blackwell.
The root cause was SM120's 99KB SMEM limitation and a CUTLASS TMA layout bug when K < 128; the patch introduces an EffBlk_SF computation adjustment and folding of scale factors to fit hardware constraints.
The results include detailed performance figures across configurations, the hardware and software environment, and a PR submission to FlashInfer with a pre-built Docker image available.

TL;DR: Built a custom CUTLASS kernel to fix SM120's broken MoE GEMM tiles. Went from 55 tok/s (WSL2) → 119 (native Linux) → 142 (driver/config optimization) → 282 tok/s (custom K=64 kernel). PR submitted to FlashInfer, pre-built Docker image available.

The Problem

If you're running NVFP4 MoE models (Qwen3.5-397B, DeepSeek, etc.) on RTX PRO 6000, RTX 5090, or DGX Spark — basically any SM120 Blackwell workstation GPU — you've probably seen this:

Failed to initialize cutlass TMA WS grouped gemm

The autotuner skips all the SM120 GEMM tiles because they overflow your GPU's 99KB shared memory. Datacenter Blackwell (B200) has 228KB SMEM, so the tiles were designed for that. Your workstation GPU gets stuck on slow fallback kernels.

Result: You're leaving 50%+ of your throughput on the table.

The Fix

The issue is that K=128 tile shapes need more SMEM than SM120 has. K=64 tiles would fit, but CUTLASS had a bug: the TMA scale factor layout assumes K≥128 and creates a layout mismatch when K=64 (Blk_SF=4 but K=64 only has 2 scale factors along K).

I patched sm120_blockscaled_mma_builder.inl in CUTLASS to:

Compute EffBlk_SF = min(K/SFVectorSize, Blk_SF) to handle K<128
Fold scale factors into the basic block when they exceed MMA requirements

This lets K=64 tiles compile and run correctly on SM120's 99KB SMEM.

Results

Hardware: 4x NVIDIA RTX PRO 6000 Blackwell (96GB GDDR7 each, SM 12.0) Model: Qwen3.5-397B-A17B-NVFP4, TP=4, MTP=5 Environment: CUDA 13.2, Driver 595.45.04, vLLM 0.17.1rc1, FlashInfer 0.6.6 The Sehyo version of QWen3.5-397-a17b-NVFP4.

Users	Before (tok/s)	After (tok/s)	Improvement
1	142	283	+99%
4	250	850	+240%
8	510	1,283	+151%

The full journey from WSL2:

Config	1-user tok/s
WSL2 baseline	55
Native Linux	119
+ MTP=5 + config tuning	134
+ Driver 595 + CUDA 13.2 + iommu=pt	142
+ Custom K=64 kernel	283

How to Use It

Pre-built Docker image (easiest)

docker pull verdictai/vllm-blackwell-k64:latest docker run -d --name vllm --gpus all --ipc host --shm-size 32g \ -p 9200:8000 \ -v /path/to/sehyo-qwen35-nvfp4:/model:ro \ -e NCCL_P2P_DISABLE=1 \ -e VLLM_WORKER_MULTIPROC_METHOD=spawn \ verdictai/vllm-blackwell-k64:latest \ python3 -m vllm.entrypoints.openai.api_server \ --model /model --served-model-name qwen3.5-397b-nvfp4 \ --host 0.0.0.0 --port 8000 --trust-remote-code \ --tensor-parallel-size 4 --gpu-memory-utilization 0.85 \ --max-model-len 262144 --enable-prefix-caching \ --reasoning-parser qwen3 --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --speculative-config '{"method":"mtp","num_speculative_tokens":5}'

Important notes for Threadripper users

NCCL_P2P_DISABLE=1 — AMD-Vi IOMMU causes page faults with GPU P2P. Add iommu=pt to kernel params if you want to try P2P instead.
Driver 595 — Install from NVIDIA CUDA repo: sudo apt install nvidia-open (after adding the repo). Significant improvement over 580/590 for SM120.

Other optimizations that helped

OMP_NUM_THREADS=6 (not 24 — avoids oversubscription with TP=4)
CUDA_DEVICE_MAX_CONNECTIONS=32
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
MTP=5 for single-user, MTP=3 for multi-user

Upstream PR

FlashInfer PR: https://github.com/flashinfer-ai/flashinfer/pull/2786

The fix is two files:

CUTLASS builder (sm120_blockscaled_mma_builder.inl) — the actual kernel fix
Codegen (generate_kernels.py) — enables K=64 tile generation for SM120

Related CUTLASS issue: https://github.com/NVIDIA/cutlass/issues/3096

Who this helps

Anyone running MoE models with NVFP4 quantization on:

RTX PRO 6000 (Blackwell workstation)
RTX 5090 (consumer Blackwell)
DGX Spark
Any SM120/SM121 GPU with ~99KB SMEM

Benchmark Results

Output Length × Concurrency (all values in tok/s)

Output Length	1 User	2 Users (system)	2 Users (per-user)	4 Users (system)	4 Users (per-user)
1,000	278	506	253	857	214
2,000	282	480	240	844	211
8,000	261	468	234	792	198
16,000	231	415	208	732	183
32,000	192	351	175	620	155

Higher Concurrency (1K output tokens)

Users	System tok/s	Per-user tok/s
1	283	283
4	857	214
8	1,283	160
16	1,624	102

Context Length Scaling (1 user, 1K output)

Input Context	tok/s
~128 tokens	283
1K	277
4K	247
16K	183
32K	141

Before vs After (K=64 kernel patch)

Metric	Before	After	Change
1 user decode	142	283	+99%
4 user system	250	857	+243%
8 user system	510	1,283	+151%
16 user system	—	1,624	—
8 user per-user	64	160	+150%

The Full Journey

Config	1-user tok/s
WSL2 baseline	55
Native Linux	119
+ MTP=5 + config tuning	134
+ Driver 595 + CUDA 13.2 + iommu=pt	142
+ Custom K=64 kernel	283

If you've been stuck at 110-140 tok/s wondering why the B200 benchmarks show 300+, this is why. The tiles were broken on your hardware.

I want to be transparent about what these numbers represent.

The 283 tok/s figure is measured with thinking mode enabled and a short prompt. Qwen3.5 generates <think></think> tags even when there's nothing to reason about, and MTP (Multi-Token Prediction) achieves near-100% acceptance on these trivial, predictable tokens. This inflates the measured throughput significantly.

With thinking disabled and real prompts (substantive generation — essays, code, detailed explanations), single-user throughput is ~130-136 tok/s. This is the number that matters for actual usage.

Scenario	1 User tok/s	Notes
Short prompt, thinking ON	283	MTP inflated by trivial think tokens
Real prompt, thinking ON	161	Think tokens still boost MTP acceptance
Real prompt, thinking OFF	~130-136	Actual usable throughput
Pre-patch baseline (community reports)	~110	Same hardware, no K=64 fix

The K=64 kernel patch still provides a real ~20-25% improvement over the pre-patch baseline on identical hardware. The fix unblocks SM120 GPUs from falling back to slow GEMM paths by giving the autotuner K=64 tiles that fit within 99KB SMEM.

Multi-user throughput with thinking OFF and real prompts:

Users	System tok/s	Per-user tok/s
1	136	136
2	217	109
4	342	85
8	472	59
16	605	38

I wanted the methodology to be clear to mark the difference between what you might see in "Day to day" use as an end user versus because case scenario engine throughput as I understand it to be bencmarked. Happy to answer questions. This was a wild debugging session — went from "the CUTLASS tiles just don't work on SM120" to "oh, the scale factor SMEM layout has a hardcoded assumption about K≥128" to a working fix in last several nights. lol.

submitted by /u/lawdawgattorney
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/15DailyView insight →

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

Ledge.ai

The programming passion is melting

Dev.to

Best AI Tools for Property Managers in 2026

Dev.to

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

55 → 282 tok/s: How I got Qwen3.5-397B running at speed on 4x RTX PRO 6000 Blackwell

Key Points

The Problem

The Fix

Results

How to Use It

Pre-built Docker image (easiest)

Important notes for Threadripper users

Other optimizations that helped

Upstream PR

Who this helps

Benchmark Results