Codebook Lossless LLM Compression: 10–25%+ RAM reduction with bitwise generic packing of indexed weights

Reddit r/LocalLLaMA / 3/14/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The post investigates how many unique values exist in LLM layer weights and finds that, while fp16 uses 16 bits, only about 12-13 bits are effectively needed, indicating redundancy that can be exploited.
By packing these indexed weights bitwise in a codebook-like scheme, the approach reduces RAM usage by about 10-25% at the cost of slower inference speed (speed roughly halved in tests).
The method has been tested on devices including a P2200 (5 GB) GPU and CPUs, with ongoing work to extend to a 32 GB MI50, and a lossy/balanced variant is also explored.
The work is presented as a narrative with a proof-of-concept codebase, inviting readers to review the repository and accompanying write-up, and it raises the idea of a new metric for model compactness.

So I asked myself a question (and then asked a coding model to build some pieces for me).. when we talk about the values in a layer of an LLM, how many are actually unique? The answer led me down a couple weeks of coding. (yes, with Claude, Qwen, and Gemini).

fp16 is 16 bits. most of the models I ran into really only use about 12-13 bits of unique values... but packing those into a block, we can squeeze most of the models I tried down by 10-25%. By trading a bit of inference speed for size, we can squeeze models onto smaller cards. (speed is ~ halved for my example test)

I've baked in a lossy/balanced version as well, but haven't tested it as much. What's been tested was on my small P2200 (5G) card, and CPU, and I'm working on updates for my 32G MI50.

I'm also wondering if this might be a good way to measure the "compactness" of a model.

Article is my narrative of the journey (paywall removed), and here's the current proof of concept code: https://github.com/bigattichouse/Codebook-Quantization

submitted by /u/bigattichouse
[link] [comments]

Astral to Join OpenAI

Dev.to

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Dev.to

Your AI coding agent is installing vulnerable packages. I built the fix.

Dev.to

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

Codebook Lossless LLM Compression: 10–25%+ RAM reduction with bitwise generic packing of indexed weights

Key Points

Related Articles

Astral to Join OpenAI

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Your AI coding agent is installing vulnerable packages. I built the fix.

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer