| MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — 127.7 tok/s C=1, 2800 peak C=128 Ran a full sweep on Luke Alonso's M2.7 NVFP4 quant. Writing it down for anyone shopping the same setup. **Hardware:** AsRock Rack B650D4U-2L2T, EPYC 4564P, 128GB DDR5 ECC, 2x RTX PRO 6000 Blackwell (96GB, 600W) behind a C-Payne PM50100 PLX Gen5 switch (PIX topology). **Software:** SGLang via voipmonitor/sglang:cu130 docker (b12x 0.8.3), modelopt_fp4, bf16 KV, TP=2, Luke's default recipe. **Decode throughput (ctx=0, 3x mean, 30s/cell):** | C | agg tok/s | per-req tok/s | |---|-----------|---------------| | 1 | 127.7 | 127.7 | | 8 | 471.6 | 59.0 | | 32 | 1078.9 | 33.7 | | 64 | 1695.4 | 26.5 | | 128 | 2800.2 | 21.9 | **Prefill (C=1):** | ctx | TTFT | tok/s | |-----|------|-------| | 8K | 0.50s | 17,286 | | 16K | 0.99s | 16,926 | | 32K | 2.09s | 15,861 | | 64K | 4.94s | 13,319 | | 128K | 13.25s | 9,908 | No speculative decoding — there's no NEXTN drafter for M2.7 yet. When one ships expect a meaningful jump at low concurrency. Long-context cells skip at high concurrency (KV pool is ~83K tokens on bf16-KV TP=2). 16K is fine up to about C=8 per-req before queue contention kicks in; 128K is C=1-only territory. Full methodology and caveats: Thanks to Luke for the kernels + quant, and to Jon for the recent calibration data update on the M2.7 NVFP4 weights. [link] [comments] |
MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — bench numbers
Reddit r/LocalLLaMA / 4/13/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage
Key Points
- MiniMax-M2.7 NVFP4 on a dual RTX PRO 6000 Blackwell setup achieves 127.7 tok/s at C=1 and scales to about 2,800 tok/s peak aggregate at C=128, with per-request throughput dropping as concurrency rises.
- Prefill throughput is measured across context lengths, ranging from ~17.3k tok/s at 8K to ~9.9k tok/s at 128K, indicating the expected slowdown as sequence length grows.
- The benchmark uses SGLang in a container with modelopt_fp4 and bf16 KV (TP=2), with TP=2 and quantized weights as key parts of the inference stack.
- Speculative decoding is not included because an M2.7 NEXTN drafter is not available yet, and the author expects a larger low-concurrency boost once it ships.
- Practical constraints are highlighted: long-context “cells” become inefficient at high concurrency due to a KV pool of ~83K tokens, with 16K working well up to roughly C=8 before queue contention.




