Nemotron 3 Super - large quality difference between llama.cpp and vLLM?

Reddit r/LocalLLaMA / 3/29/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep Analysis

Key Points

  • The author compares Nemotron 3 Super performance on the same private reasoning/knowledge benchmark across vLLM and llama.cpp and finds vLLM scores much higher (55.4% vs 40.2%).
  • The benchmark setup disables “thinking,” uses temperature 0.7, and keeps other parameters near defaults, suggesting the gap may not be driven by temperature/top-p choices.
  • They suspect it is not inherent to the NVFP4 format because similar patterns appear with other quantizations (Q4/Q8/F16 in general) and with a different quant variant (~40% on llama.cpp).
  • A cross-model sanity check with Gemma 3 27B shows nearly identical results between llama.cpp and vLLM, implying the discrepancy may be specific to Nemotron 3 Super or to how it is handled.
  • The author asks whether there are additional generation/runtime parameters or implementation differences between vLLM and llama.cpp that could explain the large quality delta, noting they did not see changes across newer llama.cpp versions.

Hey all,

I have a private knowledge/reasoning benchmark I like to use for evaluating models. It's a bit over 400 questions, intended for non-thinking modes, programatically scored. It seems to correlate quite well with the model's quality, at least for my usecases. Smaller models (24-32B) tend to score ~40%, larger ones (70B dense or somewhat larger MoEs) often score ~50%, and the largest ones I can run (Devstral 2/low quants of GLM 4.5-7) get up to ~60%.

On launch of Nemotron 3 Super it seemed llama.cpp support was not instantly there, so I thought I'd try vLLM to run the NVFP4 version. It did surprisingly well on the test: 55.4% with 10 attempts per question. Similar score to GPT-OSS-120B (medium/high effort). But, running the model on llama.cpp, it does far worse: 40.2% with 20 attempts per question (unsloth Q4_K_XL).

My logs for either one look relatively "normal." Obviously more errors with the gguf (and slightly shorter responses on average), but it was producing coherent text. The benchmark script passes {"enable_thinking": false} either way to disable thinking, sets temperature 0.7, and otherwise leaves most parameters about default. I reran the test in llama.cpp with nvidia's recommended temperature 1.0 and saw no difference. In general, I haven't found temperature to have a significant impact on this test. They also recommend top-p 0.95 but that seems to be the default anyways.

I generally see almost no significant difference between Q4_*, Q8_0, and F16 ggufs, so I doubt there could be any inherent "magic" to NVFP4 making it do this much better. Also tried bartowski's Q4_K_M quant and got a similar ~40% score.

Fairly basic launch commands, something like: vllm serve "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" --port 8080 --trust-remote-code --gpu-memory-utilization 0.85 and llama-server -c (whatever) -m NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL.gguf.

So, the question: Is there some big difference in other generation parameters between these I'm missing that might be causing this, or another explanation? I sat on this for a bit in case there was a bug in initial implementations but not seeing any changes with newer versions of llama.cpp.

I tried a different model to narrow things down:

  • koboldcpp, gemma 3 27B Q8: 40.2%
  • llama.cpp, gemma 3 27B Q8: 40.6%
  • vLLM, gemma 3 27B F16: 40.0%

Pretty much indistinguishable. 5 attempts/question for each set here, and the sort of thing I'd expect to see.

Using vllm 0.17.1, llama.cpp 8522.

submitted by /u/BigStupidJellyfish_
[link] [comments]