Blackwell LLM Toolkit - NVFP4 Config +Wheels + Benchmarks for Blackwell GPUs via TensorRT-LLM - 270 tk/s Nemotron 3 Omni

Reddit r/LocalLLaMA / 5/12/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The article shares a GitHub-based “Blackwell LLM Toolkit” that provides NVFP4 configurations, TensorRT-LLM setup, wheels, and benchmark scripts for running LLMs on NVIDIA Blackwell GPUs (e.g., RTX Pro 6000, 5090/5080/5070 Ti) as long as the models fit in memory.
  • It documents key “gotchas” for TensorRT-LLM (including enabling obscure launch flags for newer Mamba-hybrid models via a specific YAML config) to make inference work correctly.
  • For Blackwell memory constraints, it explains using LMCache with SSD offloading and rebuilding an LMCache PyPI wheel from source to fix missing sm_120 cubins crashes.
  • It includes research notes generated to clarify architectural differences across recent model families (Nemotron Omni V3, Qwen 3.5/3.6, Gemma 4), helping avoid misconfiguration traps such as confusing Qwen 3.5/3.6 with similarly named but different models.
  • Benchmark highlights report sustained decode speeds with single RTX Pro 6000 96GB (no tensor parallelism), including ~270 tok/s for Nemotron 3 Nano Omni V3 NVFP4 at 8k context and ~249 tok/s for the text-only Nemotron 3 Nano NVFP4 at 8k context.

Was trying to get a good set of models with NVFP4 to leverage the RTX Pro 6000 and was able to get across a few hurdles and have configs + wheels set up & ran benchmarks while i was at it. hopefully this helps some folks out.

This should work on all the Nvidia Blackwell cards. 5090, 5080, 5070ti etc. as long as the models fit. (like maybe stack 2x 5070TI's)

Anyhow, here's the repo of things:

https://github.com/elsung/blackwell-llm-toolkit

Gotchas & solutions

  • TRT-LLM launch flags

    • Some obscure settings had to be enabled to make TensorRT-LLM run the newer Mamba-hybrid models. YAML file in the repo at `configs/trtllm/nemotron-omni-v3-sm120.yaml`.
  • LMCache

    • Offloading context to SSD to make space for model on VRAM. The PyPI wheel was crashing on Blackwell (missing sm_120 cubins), so I rebuilt it from source. Works great on my Optane drive. Both the prebuilt wheel and the build script are in the repo.
  • Research docs

    • AI-outputted deep-dives on what's actually different about the latest model families (Nemotron Omni V3, Qwen 3.5/3.6, Gemma 4). Helpful reference. The Qwen 3.5/3.6 one in particular saved me from a nasty trap — they look like renamed Qwen3-VL but are completely different architecture under the hood.

Benchmark highlights

single RTX Pro 6000 96GB, no tensor parallelism. Speed numbers are sustained decode tok/s (median of 3 runs, 500-token completions).

Nemotron-3-Nano-Omni V3 (multimodal — image/video/audio + text)

Nemotron-3-Nano (text only)

DeepSeek-V4-Flash

MiniMax-M2.7-REAP-172B

MiniMax-M2.7 W4A16 (with LMCache → Optane SSD)

MiniMax-M2.7 W4A16 (short ctx, no LMCache)

  • Same model as above, tested at 64k context → **22-25 tok/s**
  • Highest-quality short answers (10/10 intel).

Full table with TTFT, prefill speeds, concurrency numbers, and all quality eval scores → bench/results.md in the repo: https://github.com/elsung/blackwell-llm-toolkit/blob/main/bench/results.md

Bench tools used to validate

  • `rapid_bench.py` — 41-prompt quality eval (10 intelligence + 10 tool-use + 13 calibration + 3 orchestration + 5 creative writing)
  • `bench_harness.py` — sustained decode + TTFT + prefill + concurrency, plus a `--prompt-tokens N` mode for the 154k long-ctx mjpansa runs

Apache 2.0, PRs welcome — especially benchmark contributions from other Blackwell GPU folks (RTX 5090/5080/5070TI) so the comparison fills out across different hardware.

submitted by /u/elsung
[link] [comments]