TurboQuant for weights: near‑optimal 4‑bit LLM quantization with lossless 8‑bit residual – 3.2× memory savings

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The article describes “TurboQuant for weights,” an adaptation of TurboQuant for compressing transformer weight matrices by providing a drop-in replacement for PyTorch’s `nn.Linear` layers.
It targets near-optimal 4-bit LLM quantization accuracy by using a lossless 8-bit residual scheme (reported as a “4+4 residual” configuration totaling 8 bits).
Benchmarks on Qwen3.5-0.8B with WikiText-103 show that 4+4 residual (8 bits total) matches baseline bf16 perplexity (Δ PPL = 0.00) while reducing memory usage to 762 MB from 1,504 MB (about 2× savings).
Pure 4-bit configurations reduce model size further (361–381 MB) but incur higher perplexity degradation (Δ PPL between +1.94 and +2.28).
The post points readers to a GitHub repository for documentation, additional benchmarks, and Triton kernel implementation details.

an adaptation of the recent TurboQuant algorithm (Zandieh et al., 2025) from KV‑cache quantization to model weight compression. It gives you a drop‑in replacement for nn.Linear with near‑optimal distortion.

Benchmarks (Qwen3.5‑0.8B, WikiText‑103)

Config	Bits	PPL	Δ PPL	Compressed Size
Baseline bf16	16	14.29	–	1,504 MB
4+4 residual	8	14.29	0.00	762 MB
4‑bit (group=full)	4	16.23	+1.94	361 MB
4‑bit (group=128)	4	16.57	+2.28	381 MB

Check the GitHub repo for full docs, benchmarks, and Triton kernel details.

submitted by /u/cksac
[link] [comments]

[Boost]

Dev.to

Managing LLM context in a real application

Dev.to

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

Dev.to

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)

Dev.to

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned

Dev.to

TurboQuant for weights: near‑optimal 4‑bit LLM quantization with lossless 8‑bit residual – 3.2× memory savings

Key Points

Related Articles

[Boost]

Managing LLM context in a real application

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer