Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization

arXiv cs.CL / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper finds that for extreme 2-bit additive quantization of LLMs, catastrophic failures are primarily driven by poor codebook initialization rather than by later search or fine-tuning alone.
It shows greedy sequential initialization often lands the model in bad optimization regions that beam search and PV-tuning cannot reliably recover from, especially at tighter compression rates.
Using an analysis based on the representational ratio (ρ̂ = N/KM), the authors demonstrate how the severity of the initialization bottleneck scales with codebook capacity versus weight-group structure.
They propose OA-EM, an output-aware EM initialization method that uses Hessian-weighted Mahalanobis distance, which consistently yields better quantized-model quality after PV-tuning.
Across multiple architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B) and compression settings, OA-EM improves the quality-compute tradeoff and can prevent perplexity from degrading by orders of magnitude at 2 bpp.

Abstract

Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio \r{ho} = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with \r{ho}: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.

Black Hat Asia

AI Business

CIA is trusting AI to help analyze intel from human spies

Reddit r/artificial

LLM API Pricing in 2026: I Put Every Major Model in One Table

Dev.to

i generated AI video on a GTX 1660. here's what it actually takes.

Dev.to

The $50,000 Build with MeDo Hackathon is NOW LIVE!

Dev.to

Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization

Key Points

Abstract

Related Articles

Black Hat Asia

CIA is trusting AI to help analyze intel from human spies

LLM API Pricing in 2026: I Put Every Major Model in One Table

i generated AI video on a GTX 1660. here's what it actually takes.

The $50,000 Build with MeDo Hackathon is NOW LIVE!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer