TL;DR: Built a custom CUTLASS kernel to fix SM120's broken MoE GEMM tiles. Went from 55 tok/s (WSL2) → 119 (native Linux) → 142 (driver/config optimization) → 282 tok/s (custom K=64 kernel). PR submitted to FlashInfer, pre-built Docker image available.
The Problem
If you're running NVFP4 MoE models (Qwen3.5-397B, DeepSeek, etc.) on RTX PRO 6000, RTX 5090, or DGX Spark — basically any SM120 Blackwell workstation GPU — you've probably seen this:
Failed to initialize cutlass TMA WS grouped gemm The autotuner skips all the SM120 GEMM tiles because they overflow your GPU's 99KB shared memory. Datacenter Blackwell (B200) has 228KB SMEM, so the tiles were designed for that. Your workstation GPU gets stuck on slow fallback kernels.
Result: You're leaving 50%+ of your throughput on the table.
The Fix
The issue is that K=128 tile shapes need more SMEM than SM120 has. K=64 tiles would fit, but CUTLASS had a bug: the TMA scale factor layout assumes K≥128 and creates a layout mismatch when K=64 (Blk_SF=4 but K=64 only has 2 scale factors along K).
I patched sm120_blockscaled_mma_builder.inl in CUTLASS to:
- Compute
EffBlk_SF = min(K/SFVectorSize, Blk_SF)to handle K<128 - Fold scale factors into the basic block when they exceed MMA requirements
This lets K=64 tiles compile and run correctly on SM120's 99KB SMEM.
Results
Hardware: 4x NVIDIA RTX PRO 6000 Blackwell (96GB GDDR7 each, SM 12.0) Model: Qwen3.5-397B-A17B-NVFP4, TP=4, MTP=5 Environment: CUDA 13.2, Driver 595.45.04, vLLM 0.17.1rc1, FlashInfer 0.6.6 The Sehyo version of QWen3.5-397-a17b-NVFP4.
| Users | Before (tok/s) | After (tok/s) | Improvement |
|---|---|---|---|
| 1 | 142 | 283 | +99% |
| 4 | 250 | 850 | +240% |
| 8 | 510 | 1,283 | +151% |
The full journey from WSL2:
| Config | 1-user tok/s |
|---|---|
| WSL2 baseline | 55 |
| Native Linux | 119 |
| + MTP=5 + config tuning | 134 |
| + Driver 595 + CUDA 13.2 + iommu=pt | 142 |
| + Custom K=64 kernel | 283 |
How to Use It
Pre-built Docker image (easiest)
docker pull verdictai/vllm-blackwell-k64:latest docker run -d --name vllm --gpus all --ipc host --shm-size 32g \ -p 9200:8000 \ -v /path/to/sehyo-qwen35-nvfp4:/model:ro \ -e NCCL_P2P_DISABLE=1 \ -e VLLM_WORKER_MULTIPROC_METHOD=spawn \ verdictai/vllm-blackwell-k64:latest \ python3 -m vllm.entrypoints.openai.api_server \ --model /model --served-model-name qwen3.5-397b-nvfp4 \ --host 0.0.0.0 --port 8000 --trust-remote-code \ --tensor-parallel-size 4 --gpu-memory-utilization 0.85 \ --max-model-len 262144 --enable-prefix-caching \ --reasoning-parser qwen3 --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --speculative-config '{"method":"mtp","num_speculative_tokens":5}' Important notes for Threadripper users
NCCL_P2P_DISABLE=1— AMD-Vi IOMMU causes page faults with GPU P2P. Addiommu=ptto kernel params if you want to try P2P instead.- Driver 595 — Install from NVIDIA CUDA repo:
sudo apt install nvidia-open(after adding the repo). Significant improvement over 580/590 for SM120.
Other optimizations that helped
OMP_NUM_THREADS=6(not 24 — avoids oversubscription with TP=4)CUDA_DEVICE_MAX_CONNECTIONS=32PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True- MTP=5 for single-user, MTP=3 for multi-user
Upstream PR
FlashInfer PR: https://github.com/flashinfer-ai/flashinfer/pull/2786
The fix is two files:
- CUTLASS builder (
sm120_blockscaled_mma_builder.inl) — the actual kernel fix - Codegen (
generate_kernels.py) — enables K=64 tile generation for SM120
Related CUTLASS issue: https://github.com/NVIDIA/cutlass/issues/3096
Who this helps
Anyone running MoE models with NVFP4 quantization on:
- RTX PRO 6000 (Blackwell workstation)
- RTX 5090 (consumer Blackwell)
- DGX Spark
- Any SM120/SM121 GPU with ~99KB SMEM
Benchmark Results
Output Length × Concurrency (all values in tok/s)
| Output Length | 1 User | 2 Users (system) | 2 Users (per-user) | 4 Users (system) | 4 Users (per-user) |
|---|---|---|---|---|---|
| 1,000 | 278 | 506 | 253 | 857 | 214 |
| 2,000 | 282 | 480 | 240 | 844 | 211 |
| 8,000 | 261 | 468 | 234 | 792 | 198 |
| 16,000 | 231 | 415 | 208 | 732 | 183 |
| 32,000 | 192 | 351 | 175 | 620 | 155 |
Higher Concurrency (1K output tokens)
| Users | System tok/s | Per-user tok/s |
|---|---|---|
| 1 | 283 | 283 |
| 4 | 857 | 214 |
| 8 | 1,283 | 160 |
| 16 | 1,624 | 102 |
Context Length Scaling (1 user, 1K output)
| Input Context | tok/s |
|---|---|
| ~128 tokens | 283 |
| 1K | 277 |
| 4K | 247 |
| 16K | 183 |
| 32K | 141 |
Before vs After (K=64 kernel patch)
| Metric | Before | After | Change |
|---|---|---|---|
| 1 user decode | 142 | 283 | +99% |
| 4 user system | 250 | 857 | +243% |
| 8 user system | 510 | 1,283 | +151% |
| 16 user system | — | 1,624 | — |
| 8 user per-user | 64 | 160 | +150% |
The Full Journey
| Config | 1-user tok/s |
|---|---|
| WSL2 baseline | 55 |
| Native Linux | 119 |
| + MTP=5 + config tuning | 134 |
| + Driver 595 + CUDA 13.2 + iommu=pt | 142 |
| + Custom K=64 kernel | 283 |
If you've been stuck at 110-140 tok/s wondering why the B200 benchmarks show 300+, this is why. The tiles were broken on your hardware.
I want to be transparent about what these numbers represent.
The 283 tok/s figure is measured with thinking mode enabled and a short prompt. Qwen3.5 generates <think></think> tags even when there's nothing to reason about, and MTP (Multi-Token Prediction) achieves near-100% acceptance on these trivial, predictable tokens. This inflates the measured throughput significantly.
With thinking disabled and real prompts (substantive generation — essays, code, detailed explanations), single-user throughput is ~130-136 tok/s. This is the number that matters for actual usage.
| Scenario | 1 User tok/s | Notes |
|---|---|---|
| Short prompt, thinking ON | 283 | MTP inflated by trivial think tokens |
| Real prompt, thinking ON | 161 | Think tokens still boost MTP acceptance |
| Real prompt, thinking OFF | ~130-136 | Actual usable throughput |
| Pre-patch baseline (community reports) | ~110 | Same hardware, no K=64 fix |
The K=64 kernel patch still provides a real ~20-25% improvement over the pre-patch baseline on identical hardware. The fix unblocks SM120 GPUs from falling back to slow GEMM paths by giving the autotuner K=64 tiles that fit within 99KB SMEM.
Multi-user throughput with thinking OFF and real prompts:
| Users | System tok/s | Per-user tok/s |
|---|---|---|
| 1 | 136 | 136 |
| 2 | 217 | 109 |
| 4 | 342 | 85 |
| 8 | 472 | 59 |
| 16 | 605 | 38 |
I wanted the methodology to be clear to mark the difference between what you might see in "Day to day" use as an end user versus because case scenario engine throughput as I understand it to be bencmarked. Happy to answer questions. This was a wild debugging session — went from "the CUTLASS tiles just don't work on SM120" to "oh, the scale factor SMEM layout has a hardcoded assumption about K≥128" to a working fix in last several nights. lol.
[link] [comments]




