Blackwell LLM Toolkit - NVFP4 Config +Wheels + Benchmarks for Blackwell GPUs via TensorRT-LLM - 270 tk/s Nemotron 3 Omni

Reddit r/LocalLLaMA / 5/12/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The article shares a GitHub-based “Blackwell LLM Toolkit” that provides NVFP4 configurations, TensorRT-LLM setup, wheels, and benchmark scripts for running LLMs on NVIDIA Blackwell GPUs (e.g., RTX Pro 6000, 5090/5080/5070 Ti) as long as the models fit in memory.
It documents key “gotchas” for TensorRT-LLM (including enabling obscure launch flags for newer Mamba-hybrid models via a specific YAML config) to make inference work correctly.
For Blackwell memory constraints, it explains using LMCache with SSD offloading and rebuilding an LMCache PyPI wheel from source to fix missing sm_120 cubins crashes.
It includes research notes generated to clarify architectural differences across recent model families (Nemotron Omni V3, Qwen 3.5/3.6, Gemma 4), helping avoid misconfiguration traps such as confusing Qwen 3.5/3.6 with similarly named but different models.
Benchmark highlights report sustained decode speeds with single RTX Pro 6000 96GB (no tensor parallelism), including ~270 tok/s for Nemotron 3 Nano Omni V3 NVFP4 at 8k context and ~249 tok/s for the text-only Nemotron 3 Nano NVFP4 at 8k context.

Was trying to get a good set of models with NVFP4 to leverage the RTX Pro 6000 and was able to get across a few hurdles and have configs + wheels set up & ran benchmarks while i was at it. hopefully this helps some folks out.

This should work on all the Nvidia Blackwell cards. 5090, 5080, 5070ti etc. as long as the models fit. (like maybe stack 2x 5070TI's)

Anyhow, here's the repo of things:

https://github.com/elsung/blackwell-llm-toolkit

Gotchas & solutions

TRT-LLM launch flags
- Some obscure settings had to be enabled to make TensorRT-LLM run the newer Mamba-hybrid models. YAML file in the repo at `configs/trtllm/nemotron-omni-v3-sm120.yaml`.
LMCache
- Offloading context to SSD to make space for model on VRAM. The PyPI wheel was crashing on Blackwell (missing sm_120 cubins), so I rebuilt it from source. Works great on my Optane drive. Both the prebuilt wheel and the build script are in the repo.
Research docs
- AI-outputted deep-dives on what's actually different about the latest model families (Nemotron Omni V3, Qwen 3.5/3.6, Gemma 4). Helpful reference. The Qwen 3.5/3.6 one in particular saved me from a nasty trap — they look like renamed Qwen3-VL but are completely different architecture under the hood.

Benchmark highlights

single RTX Pro 6000 96GB, no tensor parallelism. Speed numbers are sustained decode tok/s (median of 3 runs, 500-token completions).

Nemotron-3-Nano-Omni V3 (multimodal — image/video/audio + text)

NVFP4 quant, tested at 8k context → **270 tok/s**
Fastest + handles all modalities. Needs TRT-LLM v1.3.0rc13.
https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4

Nemotron-3-Nano (text only)

NVFP4 quant, tested at 8k context → **249 tok/s**
Best for tool-calling agents (10/10 on tools).
https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4

DeepSeek-V4-Flash

IQ2_XXS-XL GGUF, tested at 65k context → **31 tok/s**
Best for complex reasoning (9/10 intel + 10/10 tools + 13/13 calibration).
https://huggingface.co/teamblobfish/DeepSeek-V4-Flash-GGUF (IQ2_XXS-XL)

MiniMax-M2.7-REAP-172B

Q3_K_S GGUF, tested at 196k context → **117 tok/s**
Long conversations.
https://huggingface.co/exdysa/MiniMax-M2.7-REAP-172B-A10B-GGUF (Q3_K_S)

MiniMax-M2.7 W4A16 (with LMCache → Optane SSD)

W4A16 AutoRound, tested at 154k context → **20-22 tok/s**
Long-ctx with W4A16-quality answers, KV cache offloaded to SSD.
https://huggingface.co/MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16

MiniMax-M2.7 W4A16 (short ctx, no LMCache)

Same model as above, tested at 64k context → **22-25 tok/s**
Highest-quality short answers (10/10 intel).

Full table with TTFT, prefill speeds, concurrency numbers, and all quality eval scores → bench/results.md in the repo: https://github.com/elsung/blackwell-llm-toolkit/blob/main/bench/results.md

Bench tools used to validate

`rapid_bench.py` — 41-prompt quality eval (10 intelligence + 10 tool-use + 13 calibration + 3 orchestration + 5 creative writing)
`bench_harness.py` — sustained decode + TTFT + prefill + concurrency, plus a `--prompt-tokens N` mode for the 154k long-ctx mjpansa runs

Apache 2.0, PRs welcome — especially benchmark contributions from other Blackwell GPU folks (RTX 5090/5080/5070TI) so the comparison fills out across different hardware.

submitted by /u/elsung
[link] [comments]

Black Hat USA

AI Business

I built a free AI tell detector after my own Reddit account got 2 'all comments are AI generated' callouts in one day

Dev.to

The $30/Month AI Coding Stack That Replaces $200 Subscriptions: A 2026 Setup Guide

Dev.to

FirstCall v0.1.0: a local-first API recipe workbench for agents

Dev.to

AI-Generated Code: Your Validation Checklist for Non-Developers

Dev.to

Blackwell LLM Toolkit - NVFP4 Config +Wheels + Benchmarks for Blackwell GPUs via TensorRT-LLM - 270 tk/s Nemotron 3 Omni

Key Points

Gotchas & solutions

Benchmark highlights

Bench tools used to validate

Related Articles

Black Hat USA

I built a free AI tell detector after my own Reddit account got 2 'all comments are AI generated' callouts in one day

The $30/Month AI Coding Stack That Replaces $200 Subscriptions: A 2026 Setup Guide

FirstCall v0.1.0: a local-first API recipe workbench for agents

AI-Generated Code: Your Validation Checklist for Non-Developers

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer