llama.cpp benchmark native vs. non native NVFP4 on Blackwell - summary

Reddit r/LocalLLaMA / 4/29/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Read original →

共有:

Key Points

The test compares two llama.cpp builds (b8966 without native NVFP4 support vs. b8967 with native NVFP4 support) using the same Qwen3.6-27B-NVFP4 model and identical CUDA settings on an RTX 5090 system.
Native NVFP4 in b8967 significantly boosts prompt processing (prompt ingestion) performance by roughly 43–68%, with an average uplift around 57%.
Token generation speed is effectively unchanged between the two builds, so the improvement mainly affects time-to-first-token and handling of long inputs.
The gains are expected to be largest for long prompts, large-context workloads, RAG/document analysis, and code-heavy prompts where prompt processing dominates overall latency.
The article notes a mismatch between the label reported by llama-bench and the actual model tested, emphasizing that the results are specifically for Qwen3.6-27B-NVFP4.

I tested two llama.cpp builds on the same Qwen3.6-27B-NVFP4 model.

llama-bench reports the model label as qwen35 27B NVFP4, but the actual tested model is Qwen3.6-27B-NVFP4.

Test platform

GPU: NVIDIA GeForce RTX 5090
CPU: AMD Ryzen 9 9950X3D
RAM: 128 GB DDR5 5600 CL36
Backend: CUDA

Tested builds

b8966 — last build without native NVFP4 support
b8967 — build with native NVFP4 support (first build with native NVFP4)

Both runs used the same model and settings: Qwen3.6-27B-NVFP4, 17.50 GiB, 26.90B parameters, CUDA backend, ngl=999, fa=1.

Main conclusion

Native NVFP4 support in b8967 significantly improves prompt processing / prompt ingestion performance, but it does not meaningfully change token generation speed.

In practical terms:

prompt processing is around 43–68% faster with native NVFP4,
average prompt processing uplift is roughly 57%,
token generation remains effectively unchanged,
long prompts, large contexts, RAG workloads, document analysis, and code-heavy prompts should benefit the most,
normal chat generation speed will feel mostly the same once generation has started.

Prompt processing results

Test	b8966 — no native NVFP4	b8967 — native NVFP4	Improvement
`pp512`	3295.10 t/s	5546.93 t/s	+68.3%
`pp2048`	3373.30 t/s	5594.58 t/s	+65.8%
`pp512 @ d4096`	3265.74 t/s	5232.92 t/s	+60.2%
`pp2048 @ d4096`	3231.69 t/s	5272.82 t/s	+63.2%
`pp512 @ d8192`	3152.71 t/s	4995.34 t/s	+58.4%
`pp2048 @ d8192`	3117.80 t/s	5005.44 t/s	+60.5%
`pp512 @ d16384`	2965.81 t/s	4537.54 t/s	+53.0%
`pp2048 @ d16384`	2934.26 t/s	4547.25 t/s	+55.0%
`pp512 @ d32768`	2514.70 t/s	3586.58 t/s	+42.6%
`pp2048 @ d32768`	2479.39 t/s	3560.58 t/s	+43.6%

The native NVFP4 build is consistently much faster during prefill. The largest gains appear at shorter and medium context sizes, where b8967 is roughly 1.6×–1.7× faster than b8966. At very long context, such as d32768, the advantage decreases but is still substantial at around 1.43× faster.

Token generation results

Test	b8966 — no native NVFP4	b8967 — native NVFP4	Difference
`tg128`	73.73 t/s	73.62 t/s	-0.1%
`tg512`	73.71 t/s	73.68 t/s	~0.0%
`tg128 @ d4096`	72.60 t/s	72.47 t/s	-0.2%
`tg512 @ d4096`	72.47 t/s	72.50 t/s	+0.0%
`tg128 @ d8192`	71.70 t/s	71.57 t/s	-0.2%
`tg512 @ d8192`	71.65 t/s	71.61 t/s	-0.1%
`tg128 @ d16384`	70.10 t/s	70.04 t/s	-0.1%
`tg512 @ d16384`	70.08 t/s	69.90 t/s	-0.3%
`tg128 @ d32768`	67.00 t/s	66.88 t/s	-0.2%
`tg512 @ d32768`	66.98 t/s	66.98 t/s	0.0%

Token generation performance is essentially identical between the two builds. The tiny differences are within normal benchmark noise.

This means native NVFP4 support improves the prefill path, but does not noticeably speed up autoregressive decoding.

Context length behavior

Both builds show a gradual slowdown as context length increases. For token generation, the drop is nearly identical:

Context	b8966 `tg512`	b8967 `tg512`
base	73.71 t/s	73.68 t/s
`d4096`	72.47 t/s	72.50 t/s
`d8192`	71.65 t/s	71.61 t/s
`d16384`	70.08 t/s	69.90 t/s
`d32768`	66.98 t/s	66.98 t/s

Going from the base test to d32768, generation speed drops from about 73.7 t/s to 67.0 t/s, which is only around a 9% decrease. That is a healthy result for a 27B model at long context.

For prompt processing, b8967 remains much faster across the whole range, but the relative advantage shrinks at very long context sizes:

Context	b8966 `pp2048`	b8967 `pp2048`	Improvement
base	3373.30 t/s	5594.58 t/s	+65.8%
`d4096`	3231.69 t/s	5272.82 t/s	+63.2%
`d8192`	3117.80 t/s	5005.44 t/s	+60.5%
`d16384`	2934.26 t/s	4547.25 t/s	+55.0%
`d32768`	2479.39 t/s	3560.58 t/s	+43.6%

Final takeaway

b8967 with native NVFP4 support is clearly better than b8966 for Qwen3.6-27B-NVFP4 on an RTX 5090 system.

It delivers a major prompt processing improvement — roughly 1.4× to 1.7× faster prefill — while keeping token generation speed effectively unchanged.

So the practical benefit is not “higher tokens per second while generating,” but rather much faster prompt ingestion, lower time-to-first-token for large prompts, and better usability with long-context workloads.

submitted by /u/mossy_troll_84
[link] [comments]

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

Dev.to

An API testing tool built specifically for AI agent loops

Dev.to

IK_LLAMA now supports Qwen3.5 MTP Support :O

Reddit r/LocalLLaMA

OpenAI models, Codex, and Managed Agents come to AWS

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

llama.cpp benchmark native vs. non native NVFP4 on Blackwell - summary

Key Points

Test platform

Tested builds

Main conclusion

Prompt processing results

Token generation results

Context length behavior

Final takeaway

Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

An API testing tool built specifically for AI agent loops

IK_LLAMA now supports Qwen3.5 MTP Support :O

OpenAI models, Codex, and Managed Agents come to AWS

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer