MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — bench numbers

Reddit r/LocalLLaMA / 4/13/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Read original →

共有:

Key Points

MiniMax-M2.7 NVFP4 on a dual RTX PRO 6000 Blackwell setup achieves 127.7 tok/s at C=1 and scales to about 2,800 tok/s peak aggregate at C=128, with per-request throughput dropping as concurrency rises.
Prefill throughput is measured across context lengths, ranging from ~17.3k tok/s at 8K to ~9.9k tok/s at 128K, indicating the expected slowdown as sequence length grows.
The benchmark uses SGLang in a container with modelopt_fp4 and bf16 KV (TP=2), with TP=2 and quantized weights as key parts of the inference stack.
Speculative decoding is not included because an M2.7 NEXTN drafter is not available yet, and the author expects a larger low-concurrency boost once it ships.
Practical constraints are highlighted: long-context “cells” become inefficient at high concurrency due to a KV pool of ~83K tokens, with 16K working well up to roughly C=8 before queue contention.

MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — bench numbers

https://preview.redd.it/zxd2awig4vug1.png?width=656&format=png&auto=webp&s=f72dc0fd05ad1380c56166e3af3de48a57fbbd75

MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — 127.7 tok/s C=1, 2800 peak C=128

Ran a full sweep on Luke Alonso's M2.7 NVFP4 quant. Writing it down for anyone shopping the same setup.

**Hardware:** AsRock Rack B650D4U-2L2T, EPYC 4564P, 128GB DDR5 ECC, 2x RTX PRO 6000 Blackwell (96GB, 600W) behind a C-Payne PM50100 PLX Gen5 switch (PIX topology).

**Software:** SGLang via voipmonitor/sglang:cu130 docker (b12x 0.8.3), modelopt_fp4, bf16 KV, TP=2, Luke's default recipe.

**Decode throughput (ctx=0, 3x mean, 30s/cell):**

| C | agg tok/s | per-req tok/s |

|---|-----------|---------------|

| 1 | 127.7 | 127.7 |

| 8 | 471.6 | 59.0 |

| 32 | 1078.9 | 33.7 |

| 64 | 1695.4 | 26.5 |

| 128 | 2800.2 | 21.9 |

**Prefill (C=1):**

| ctx | TTFT | tok/s |

|-----|------|-------|

| 8K | 0.50s | 17,286 |

| 16K | 0.99s | 16,926 |

| 32K | 2.09s | 15,861 |

| 64K | 4.94s | 13,319 |

| 128K | 13.25s | 9,908 |

No speculative decoding — there's no NEXTN drafter for M2.7 yet. When one ships expect a meaningful jump at low concurrency.

Long-context cells skip at high concurrency (KV pool is ~83K tokens on bf16-KV TP=2). 16K is fine up to about C=8 per-req before queue contention kicks in; 128K is C=1-only territory.

Full methodology and caveats:

https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/benchmarks/inference-throughput/b650d4u-2gpu.md

Thanks to Luke for the kernels + quant, and to Jon for the recent calibration data update on the M2.7 NVFP4 weights.

submitted by /u/Visual_Synthesizer
[link] [comments]