Fast NF4 Dequantization Kernels for Large Language Model Inference

arXiv cs.LG / 4/6/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper targets a key LLM inference bottleneck where NF4 (4-bit NormalFloat) quantization saves memory, but NVIDIA GPUs require costly FP16 dequantization during execution.
  • It introduces lightweight NVIDIA kernel optimizations based on shared-memory hierarchy usage to speed up NF4 dequantization while preserving compatibility with the existing HuggingFace ecosystem.
  • Experiments show 2.0–2.2× kernel speedups versus BitsAndBytes across Gemma 27B, Qwen3 32B, and Llama3.3 70B, with up to 1.54× end-to-end improvements.
  • The approach reduces instruction count via simplified indexing logic and uses only 64 bytes of shared memory per thread block, emphasizing minimal engineering overhead for substantial gains.
  • The authors position the method as a plug-and-play option that helps deploy larger models on current single-GPU infrastructure more efficiently.

Abstract

Large language models (LLMs) have grown beyond the memory capacity of single GPU devices, necessitating quantization techniques for practical deployment. While NF4 (4-bit NormalFloat) quantization enables 4\times memory reduction, inference on current NVIDIA GPUs (e.g., Ampere A100) requires expensive dequantization back to FP16 format, creating a critical performance bottleneck. This paper presents a lightweight shared memory optimization that addresses this gap through principled memory hierarchy exploitation while maintaining full ecosystem compatibility. We compare our technique against the open-source BitsAndBytes implementation, achieving 2.0--2.2\times kernel speedup across three models (Gemma 27B, Qwen3 32B, and Llama3.3 70B) and up to 1.54\times end-to-end improvement by leveraging the 12--15\times latency advantage of shared memory over global memory access. Our optimization reduces instruction counts through simplified indexing logic while using only 64 bytes of shared memory per thread block, demonstrating that lightweight optimizations can deliver substantial performance gains with minimal engineering effort. This work provides a plug-and-play solution for the HuggingFace ecosystem that democratizes access to advanced models on existing GPU infrastructure.