ggml : add NVFP4 quantization type support

Reddit r/LocalLLaMA / 3/13/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

It adds support for NVIDIA NVFP4 quantization in GGML/llama.cpp, introducing a new GGML_TYPE_NVFP4 and related block structures and conversion helpers.
The update includes convert_hf_to_gguf.py that detects NVFP4 ModelOpt models and repacks them into the GGUF block format.
The CPU backend now uses scalar dot product with ARM NEON, and tests were added for backend operations and quantization functions; tested with NVFP4 models from HuggingFace and basic server smoke tests on Apple M5.
Release is available from the b8297 tag, with a test model Qwen3-4B-NVFP4-GGUF provided for testing.

ggml : add NVFP4 quantization type support

It's available b8297 onwards. Get latest llama.cpp version.

This adds support for NVIDIA's NVFP4 quantization format (FP4 E2M1 weights, UE4M3 per-block scale, 16 elements per block). This is the format produced by NVIDIA ModelOpt's NVFP4 algo. The main difference is the scale encoding (UE4M3 vs E8M0).

What's in here:

New GGML_TYPE_NVFP4 type, block struct, UE4M3 conversion helpers, reference quantize/dequantize

convert_hf_to_gguf.py detects NVFP4 ModelOpt models and repacks into the GGUF block format

CPU backend: scalar dot product + ARM NEON

gguf-py: type constant, quant/dequant, endian conversion

Tests added to test-backend-ops and test-quantize-fns

Tested with models from https://huggingface.co/NVFP4 Apple M5 MacBook (CPU, NEON) Ran llama-bench and a basic server smoke test. Would appreciate help with that if someone has a good baseline to compare against.

Here is a Qwen3-4B model to test with.