[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA

Reddit r/MachineLearning / 4/4/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The post introduces a GPU-friendly, lossless BF16 weight compression prototype that stores each value in 12 bits by replacing the 8-bit exponent with a 4-bit group code.
It claims bit-perfect reconstruction with a very low “escape rate” (about 0.03% for most weights), where for ~99.97% of values decoding can be done using a single integer ADD operation.
The format is designed to be byte-aligned and avoids entropy coding or bitstream parsing, enabling direct use during inference with a “fused decode + matmul” approach.
Reported results on NVIDIA (e.g., RTX 5070 Ti) show inference throughput improvements over vLLM for several models, and the format is stated to work on both AMD and NVIDIA.
Early experiments suggest the escape rate remains low and fairly stable across diverse model types, from Llama and Mixtral to SDXL and CogVideoX.

[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA

Hi everyone, I am from Australia : ) I just released a new research prototype

It’s a lossless BF16 compression format that stores weights in 12 bits by replacing the 8-bit exponent with a 4-bit group code.
For 99.97% of weights, decoding is just one integer ADD.

Byte-aligned split storage: true 12-bit per weight, no 16-bit padding waste, and zero HBM read amplification.

Yes 12 bit not 11 bit !! The main idea was not just “compress weights more”, but to make the format GPU-friendly enough to use directly during inference:

sign + mantissa: exactly 1 byte per element
group: two nibbles packed into exactly 1 byte too

https://preview.redd.it/qbx94xeeo2tg1.png?width=1536&format=png&auto=webp&s=831da49f6b1729bd0a0e2d1f075786274e5a7398

1.33x smaller than BF16
Fixed-rate 12-bit per weight, no entropy coding
Zero precision loss bit-perfect reconstruction
Fused decode + matmul, so there is effectively no separate decompression stage
Byte-aligned storage, no LUT, no bitstream parsing
Works on both NVIDIA and AMD

Some results so far:

Single-user (B=1), RTX 5070 Ti

Llama 2 7B: 64.7 tok/s (1.47x vs vLLM)
Mistral 7B: 60.0 tok/s (1.10x vs vLLM)
Llama 3.1 8B: 57.0 tok/s (vLLM OOM on 16 GB)

Multi-user (B=256), total tok/s

Llama 2 7B: 2931 vs 1086 in vLLM (2.70x)
Mistral 7B: 2554 vs 872 in vLLM (2.93x)

It also seems surprisingly stable across model types:

Llama 3.1 405B: 0.034% escape rate
Mixtral 8x7B: 0.050%
SDXL UNet: 0.233%
CogVideoX 2B: 0.128%

So far this is tested on BF16 safetensors only.

Repo: https://github.com/cenconq25/Turbo-Lossless

Also worth noting: the V3 fused decode+GEMM kernel uses tensor-core patterns inspired by ZipServ / ZipGEMM (Fan et al., ASPLOS 2026).

Happy to hear criticism, edge cases, or reasons this idea won’t scale.

Thanks for your time : )

submitted by /u/Embarrassed_Will_120
[link] [comments]

Black Hat Asia

AI Business

I Audited 30+ Small Businesses on Their AI Visibility. Here's What Most Are Getting Wrong.

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Один промпт заменил мне 3 часа работы с текстами в день

Dev.to

Building an AI that analyzes stocks like Warren Buffett

Dev.to

[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA

Key Points

Related Articles

Black Hat Asia

I Audited 30+ Small Businesses on Their AI Visibility. Here's What Most Are Getting Wrong.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Один промпт заменил мне 3 часа работы с текстами в день

Building an AI that analyzes stocks like Warren Buffett

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer