Was trying to get a good set of models with NVFP4 to leverage the RTX Pro 6000 and was able to get across a few hurdles and have configs + wheels set up & ran benchmarks while i was at it. hopefully this helps some folks out.
This should work on all the Nvidia Blackwell cards. 5090, 5080, 5070ti etc. as long as the models fit. (like maybe stack 2x 5070TI's)
Anyhow, here's the repo of things:
https://github.com/elsung/blackwell-llm-toolkit
Gotchas & solutions
TRT-LLM launch flags
- Some obscure settings had to be enabled to make TensorRT-LLM run the newer Mamba-hybrid models. YAML file in the repo at `configs/trtllm/nemotron-omni-v3-sm120.yaml`.
LMCache
- Offloading context to SSD to make space for model on VRAM. The PyPI wheel was crashing on Blackwell (missing sm_120 cubins), so I rebuilt it from source. Works great on my Optane drive. Both the prebuilt wheel and the build script are in the repo.
Research docs
- AI-outputted deep-dives on what's actually different about the latest model families (Nemotron Omni V3, Qwen 3.5/3.6, Gemma 4). Helpful reference. The Qwen 3.5/3.6 one in particular saved me from a nasty trap — they look like renamed Qwen3-VL but are completely different architecture under the hood.
Benchmark highlights
single RTX Pro 6000 96GB, no tensor parallelism. Speed numbers are sustained decode tok/s (median of 3 runs, 500-token completions).
Nemotron-3-Nano-Omni V3 (multimodal — image/video/audio + text)
- NVFP4 quant, tested at 8k context → **270 tok/s**
- Fastest + handles all modalities. Needs TRT-LLM v1.3.0rc13.
- https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4
Nemotron-3-Nano (text only)
- NVFP4 quant, tested at 8k context → **249 tok/s**
- Best for tool-calling agents (10/10 on tools).
- https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
DeepSeek-V4-Flash
- IQ2_XXS-XL GGUF, tested at 65k context → **31 tok/s**
- Best for complex reasoning (9/10 intel + 10/10 tools + 13/13 calibration).
- https://huggingface.co/teamblobfish/DeepSeek-V4-Flash-GGUF (IQ2_XXS-XL)
MiniMax-M2.7-REAP-172B
- Q3_K_S GGUF, tested at 196k context → **117 tok/s**
- Long conversations.
- https://huggingface.co/exdysa/MiniMax-M2.7-REAP-172B-A10B-GGUF (Q3_K_S)
MiniMax-M2.7 W4A16 (with LMCache → Optane SSD)
- W4A16 AutoRound, tested at 154k context → **20-22 tok/s**
- Long-ctx with W4A16-quality answers, KV cache offloaded to SSD.
- https://huggingface.co/MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16
MiniMax-M2.7 W4A16 (short ctx, no LMCache)
- Same model as above, tested at 64k context → **22-25 tok/s**
- Highest-quality short answers (10/10 intel).
Full table with TTFT, prefill speeds, concurrency numbers, and all quality eval scores → bench/results.md in the repo: https://github.com/elsung/blackwell-llm-toolkit/blob/main/bench/results.md
Bench tools used to validate
- `rapid_bench.py` — 41-prompt quality eval (10 intelligence + 10 tool-use + 13 calibration + 3 orchestration + 5 creative writing)
- `bench_harness.py` — sustained decode + TTFT + prefill + concurrency, plus a `--prompt-tokens N` mode for the 154k long-ctx mjpansa runs
Apache 2.0, PRs welcome — especially benchmark contributions from other Blackwell GPU folks (RTX 5090/5080/5070TI) so the comparison fills out across different hardware.
[link] [comments]




