AI Navigate

ggml : add NVFP4 quantization type support

Reddit r/LocalLLaMA / 3/13/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • It adds support for NVIDIA NVFP4 quantization in GGML/llama.cpp, introducing a new GGML_TYPE_NVFP4 and related block structures and conversion helpers.
  • The update includes convert_hf_to_gguf.py that detects NVFP4 ModelOpt models and repacks them into the GGUF block format.
  • The CPU backend now uses scalar dot product with ARM NEON, and tests were added for backend operations and quantization functions; tested with NVFP4 models from HuggingFace and basic server smoke tests on Apple M5.
  • Release is available from the b8297 tag, with a test model Qwen3-4B-NVFP4-GGUF provided for testing.
ggml : add NVFP4 quantization type support

It's available b8297 onwards. Get latest llama.cpp version.

This adds support for NVIDIA's NVFP4 quantization format (FP4 E2M1 weights, UE4M3 per-block scale, 16 elements per block). This is the format produced by NVIDIA ModelOpt's NVFP4 algo. The main difference is the scale encoding (UE4M3 vs E8M0).

What's in here:

New GGML_TYPE_NVFP4 type, block struct, UE4M3 conversion helpers, reference quantize/dequantize

convert_hf_to_gguf.py detects NVFP4 ModelOpt models and repacks into the GGUF block format

CPU backend: scalar dot product + ARM NEON

gguf-py: type constant, quant/dequant, endian conversion

Tests added to test-backend-ops and test-quantize-fns

Tested with models from https://huggingface.co/NVFP4 Apple M5 MacBook (CPU, NEON) Ran llama-bench and a basic server smoke test. Would appreciate help with that if someone has a good baseline to compare against.

Here is a Qwen3-4B model to test with.

submitted by /u/pmttyji
[link] [comments]