TurboQuant for weights: near‑optimal 4‑bit LLM quantization with lossless 8‑bit residual – 3.2× memory savings

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The article describes “TurboQuant for weights,” an adaptation of TurboQuant for compressing transformer weight matrices by providing a drop-in replacement for PyTorch’s `nn.Linear` layers.
  • It targets near-optimal 4-bit LLM quantization accuracy by using a lossless 8-bit residual scheme (reported as a “4+4 residual” configuration totaling 8 bits).
  • Benchmarks on Qwen3.5-0.8B with WikiText-103 show that 4+4 residual (8 bits total) matches baseline bf16 perplexity (Δ PPL = 0.00) while reducing memory usage to 762 MB from 1,504 MB (about 2× savings).
  • Pure 4-bit configurations reduce model size further (361–381 MB) but incur higher perplexity degradation (Δ PPL between +1.94 and +2.28).
  • The post points readers to a GitHub repository for documentation, additional benchmarks, and Triton kernel implementation details.

an adaptation of the recent TurboQuant algorithm (Zandieh et al., 2025) from KV‑cache quantization to model weight compression. It gives you a drop‑in replacement for nn.Linear with near‑optimal distortion.

Benchmarks (Qwen3.5‑0.8B, WikiText‑103)

Config Bits PPL Δ PPL Compressed Size
Baseline bf16 16 14.29 1,504 MB
4+4 residual 8 14.29 0.00 762 MB
4‑bit (group=full) 4 16.23 +1.94 361 MB
4‑bit (group=128) 4 16.57 +2.28 381 MB

Check the GitHub repo for full docs, benchmarks, and Triton kernel details.

submitted by /u/cksac
[link] [comments]
広告