Qwen3.6-35B-A3B KLDs - INTs and NVFPs

Reddit r/LocalLLaMA / 4/26/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The post discusses using KLD (Kullback–Leibler divergence) to compare quantized Qwen3.6-35B-A3B models, emphasizing that results depend heavily on the intended use case (accuracy vs speed vs GPU-native kernels).
The author states their KLD calculation in vLLM uses real logits on-GPU and takes only a few minutes on RTX 6000 GPUs, framing KLD as a truth-at-a-math-level measure of distribution divergence.
It argues that KLD and downstream eval benchmarks can diverge: a quantization with worse KLD may still achieve better task-specific evaluation, so “bench maxing” is real and you should choose the quant for your scenario.
The post compares quantization formats, noting that FP8 generally has worse quality than INT8, while NVFP4 is described as “a lie” in practice and that NVFP4 variants with higher activation precision (e.g., A16) can improve accuracy at potential performance cost.
Overall, the takeaway is to validate with both KLD-style divergence checks and use-case-specific evals, rather than relying on a single metric or theoretical speed expectations.

KLD for INTs and NVFP4s.

AS ALWAYS - Use Case is important. Accuracy versus speed versus native kernels on your GPUs.

Things to note again:

This is done in VLLM, with REAL logits. My Repo (https://github.com/phaelon74/vllm/tree/feature/score-mode-ppl-kld) has made changes in the VLLM "hot path", so it's real, it's on GPU, and it's ~3-5 minutes on RTX 6000s
- KLD does not lie, it's just raw math against Logits
KLD tells a story of divergence.
- Evals are still important, for use-case specific
- A quant can have a worse KLD and get a better eval on a test versus a better KLD quant. This is bench maxing, and it's real. Choose the Quant for your Use-Case.
FP8 has worse quality than INT8
- This is expected, as W8A8 has activations at 8
- FP8 (W8A8) should stay in 8bit, meaning it should be faster than INT8
The NVFP4 cake, as always, is a lie.
- But similar to FP8, NVFP4 (W4A4) should stay in FP4 and "should" be faster than an INT4
- NVPF4A16 has activation of 16, and will generally have a higher quality/accuracy than NVFP4A4, but remember, this may come at a cost.