| KLD for INTs and NVFP4s. AS ALWAYS - Use Case is important. Accuracy versus speed versus native kernels on your GPUs. Things to note again:
[link] [comments] |
Qwen3.6-35B-A3B KLDs - INTs and NVFPs
Reddit r/LocalLLaMA / 4/26/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The post discusses using KLD (Kullback–Leibler divergence) to compare quantized Qwen3.6-35B-A3B models, emphasizing that results depend heavily on the intended use case (accuracy vs speed vs GPU-native kernels).
- The author states their KLD calculation in vLLM uses real logits on-GPU and takes only a few minutes on RTX 6000 GPUs, framing KLD as a truth-at-a-math-level measure of distribution divergence.
- It argues that KLD and downstream eval benchmarks can diverge: a quantization with worse KLD may still achieve better task-specific evaluation, so “bench maxing” is real and you should choose the quant for your scenario.
- The post compares quantization formats, noting that FP8 generally has worse quality than INT8, while NVFP4 is described as “a lie” in practice and that NVFP4 variants with higher activation precision (e.g., A16) can improve accuracy at potential performance cost.
- Overall, the takeaway is to validate with both KLD-style divergence checks and use-case-specific evals, rather than relying on a single metric or theoretical speed expectations.


