AI Navigate

Inference numbers for Mistral-Small-4-119B-2603 NVFP4 on a RTX Pro 6000

Reddit r/LocalLLaMA / 3/18/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The benchmark evaluates Mistral-Small-4-119B-2603 NVFP4 running on an RTX Pro 6000 with SGLang, varying context sizes from 1K to 256K, 1–5 concurrent requests, 1024 output tokens, no prompt caching, no speculative decoding, and full-precision KV caching.

Benchmarked Mistral-Small-4-119B-2603 NVFP4 on an RTX Pro 6000 card. Used SGLang, context from 1K to 256K, 1 to 5 concurrent requests, 1024 output tokens per request. No prompt caching, no speculative decoding (I couldn't get working for the NVFP4 model), full-precision KV cache. Methodology below.

Per-User Generation Speed (tok/s)

Context 1 User 2 Users 3 Users 5 Users
1K 131.3 91.2 78.2 67.3
8K 121.4 84.5 74.1 61.7
32K 110.0 75.9 63.6 53.3
64K 96.9 68.7 55.5 45.0
96K 86.7 60.4 49.7 38.1
128K 82.2 56.2 44.7 33.8
256K 64.2 42.8 N/A N/A

Time to First Token

Context 1 User 2 Users 3 Users 5 Users
1K 0.5s 0.6s 0.7s 0.8s
8K 0.9s 1.5s 2.0s 2.1s
32K 2.5s 4.5s 6.6s 10.6s
64K 6.3s 11.9s 17.5s 28.7s
96K 11.8s 23.0s 34.0s 56.0s
128K 19.2s 37.6s 55.9s 92.3s
256K 66.8s 131.9s N/A N/A

Capacity by Use Case

I found the highest concurrency that stays within these thresholds below. All without caching so it's processing the full prompt every time.

Use Case TTFT Threshold Speed Threshold Max Concurrency
Code Completion (1K) (128 output) 2s e2e N/A 5
Short-form Chatbot (8K) 10s 10 tok/s 19
General Chatbot (32K) 8s 15 tok/s 3
Long Document Processing (64K) 12s 15 tok/s 2
Automated Coding Assistant (96K) 12s 20 tok/s 1

Single-user performance is pretty good on both decode and TTFT. At higher concurrency TTFT is the binding metric. I set --mem-fraction-static 0.87 to leave room for cuda graph, which gave 15.06GB for KV cache, 703K total tokens according to SGLang. This is a decent amount to be used for caching which would help TTFT significantly for several concurrent users. I also tested vLLM using Mistral's custom container which did have better TTFT but decode was much slower, especially at longer context lengths. I'm assuming there are some issues with their vLLM container and this card. I also couldn't get speculative decoding to work. I think it's only supported for the FP8 model right now.

Methodology Notes

TTFT numbers are all without caching so worst case numbers. Caching would decrease TTFT quite a bit. Numbers are steady-state averages under sustained load (locust-based), not burst.

Methodology: https://www.millstoneai.com/inference-benchmark-methodology

Full report: https://www.millstoneai.com/inference-benchmark/mistral-small-4-119b-2603-nvfp4-1x-rtx-pro-6000-blackwell

submitted by /u/jnmi235
[link] [comments]