Nemotron-3-Super-120B-A12B NVFP4 inference benchmark on one RTX Pro 6000 Blackwell

Reddit r/LocalLLaMA / 3/13/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

Nemotron-3-Super-120B-A12B NVFP4 was benchmarked on a single RTX Pro 6000 using vLLM with fp8 KV cache and tested across contexts from 1K to 512K, with 1–5 concurrent requests and 1024 output tokens per request, with results reported as steady-state averages under sustained load.
Per-user generation speed (tokens per second) declines as both context size and concurrent user count increase, with examples such as 1K context at 1 user ~69.9 tok/s and 5 users ~41.4 tok/s, and 128K context at 5 users ~18.6 tok/s.
Time to first token grows with larger contexts and more users; for instance, 1K context yields about 0.1–0.2s for a single user, while 128K context reaches around 12.1s for 1 user and higher values for more users.
The study emphasizes it is a team-oriented benchmark not tuned for peak single-user performance, and it notes methodology details at the bottom, including that there is no prompt caching and that fp8 KV cache setup follows Nvidia's approach.

Ran Nemotron-3-Super-120B-A12B NVFP4 through a full benchmark sweep on a single RTX Pro 6000 using vLLM. fp8 KV cache (per Nvidia's setup, unclear if their metrics were tested at fp8 KV cache or not). Context from 1K to 512K, 1 to 5 concurrent requests, 1024 output tokens per request. No prompt caching.

Numbers are steady-state averages across sustained load. This is a team-oriented benchmark, not tuned for peak single-user performance. Methodology details at the bottom.

Per-User Generation Speed (tok/s)

Context	1 User	2 Users	3 Users	5 Users
1K	69.9	58.3	52.7	41.4
8K	70.8	65.7	47.8	38.8
32K	75.1	59.8	45.5	37.2
64K	67.7	50.6	40.8	27.9
96K	67.3	52.5	34.1	22.9
128K	66.8	42.6	35.0	18.6
256K	65.2	29.6	18.4	N/A
512K	62.3	N/A	N/A	N/A

Time to First Token

Context	1 User	2 Users	3 Users	5 Users
1K	0.1s	0.2s	0.2s	0.2s
8K	0.6s	0.9s	1.1s	1.2s
32K	2.3s	3.6s	4.7s	6.8s
64K	5.0s	7.6s	10.3s	14.5s
96K	8.3s	12.7s	16.8s	23.4s
128K	12.1s	18.4s	24.4s	32.5s
256K	32.6s	47.2s	64.7s	N/A
512K	98.4s	N/A	N/A	N/A

Capacity by Use Case

Each row has thresholds for each workload and shows the max concurrent requests that stay within those limits. No caching so worst-case scenario. These are just my own thresholds but the capacity charts are in the full report.

Use Case	TTFT Threshold	Speed Threshold	Max Concurrency
Code Completion (1K)	2s e2e	N/A	1
Short-form Chatbot (8K)	10s	10 tok/s	70
General Chatbot (32K)	8s	15 tok/s	7
Long Document Processing (64K)	12s	15 tok/s	3
Automated Coding Assistant (96K)	12s	20 tok/s	1

After loading model weights, only about 14GB of VRAM was left for KV cache. I tried setting the context length to 1M and it loaded without errors and the logs showed "Maximum concurrency for 1,048,576 tokens per request: 3.27x". I couldn't actually complete a request at 1M though, most likely a compute limitation. I did get a 768K request to complete but the TTFT was over 3 minutes long. Two cards will likely handle 1M and I plan to test soon.

Single-user decode speed was slower than I expected. The speed holds up across context lengths though: 62.3 tok/s at 512K is only an 11% drop from 1K 69.9 tok/s.

I had trouble getting SGLang to run well. It will likely have faster decode speed than vLLM once I get it working.

Methodology Notes

The benchmark targets concurrent/multi-user workloads. A setup tuned for one person would have better single user speeds than this one.

All TTFT numbers are without prompt caching, so these are cold prefill times. Caching would cut TTFT substantially where prefill is the bottleneck. Numbers are steady-state, not burst.

How this was tested: https://www.millstoneai.com/inference-benchmark-methodology

Full report with interactive charts: https://www.millstoneai.com/inference-benchmark/nemotron-3-super-120b-a12b-nvfp4-1x-rtx-pro-6000-blackwell

submitted by /u/jnmi235
[link] [comments]

Why Regex is Not Enough: Building a Deterministic "Sudo" Layer for AI Agents

Dev.to

I Built a Full-Stack App in 5 Minutes with 8080.ai — Here's How

Dev.to

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

TechCrunch

I Shipped 6 Developer Tools in One Day Using an AI Agent Fleet

Dev.to

Workflow Builders vs AI Agents: 5 Automation Tools Compared (2026)

Dev.to

Nemotron-3-Super-120B-A12B NVFP4 inference benchmark on one RTX Pro 6000 Blackwell

Key Points

Per-User Generation Speed (tok/s)

Time to First Token

Capacity by Use Case

Methodology Notes

Related Articles

Why Regex is Not Enough: Building a Deterministic "Sudo" Layer for AI Agents

I Built a Full-Stack App in 5 Minutes with 8080.ai — Here's How

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

I Shipped 6 Developer Tools in One Day Using an AI Agent Fleet

Workflow Builders vs AI Agents: 5 Automation Tools Compared (2026)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer