On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks

arXiv cs.LG / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how post-training quantization (PTQ) techniques—specifically GPTQ and a modified Hessian-Aware Quantization (HAWQ)—affect diffusion-based coding LLMs under low-bit settings.
  • Experiments compare a diffusion coding LLM (CoDA) against its auto-regressive counterpart (Qwen3-1.7B) using a standardized evaluation pipeline.
  • CoDA shows notably better robustness at very low bitwidths (2–4 bits), with smaller accuracy drops on HumanEval and MBPP than the auto-regressive model.
  • The authors report that mixed-precision configurations derived from HAWQ enable smoother trade-offs among accuracy, latency, and memory, supporting more efficient deployment.
  • Overall, the findings suggest diffusion LLMs may be more resilient to quantization, improving feasibility for cost- and memory-constrained inference.

Abstract

Auto-regressive Large Language Models (LLMs) achieve strong performance on coding tasks, but incur high memory and inference costs. Diffusion-based language models (d-LLMs) offer bounded inference cost via iterative denoising, but their behavior under post-training quantization (PTQ) has been sparsely explored. We investigate the application and robustness of PTQ techniques, specifically GPTQ and a modified Hessian-Aware Quantization (HAWQ) algorithm, on a diffusion-based coding LLM (CoDA) and observe that these methods applied to CoDA exhibit greater robustness at low bitwidths compared to Qwen3-1.7B, its auto-regressive counterpart, under a standardized evaluation pipeline. We find that in our setup, CoDA exhibits greater robustness at low bitwidths (2-4 bits), with smaller accuracy degradation across HumanEval and MBPP benchmarks. Additionally, mixed-precision configurations derived from HAWQ provide smooth trade-offs across accuracy, latency, and memory. The results suggest that diffusion LLMs may offer advantages for efficient deployment due to more quantization-resilience.