FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8

Reddit r/LocalLLaMA / 5/5/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

Researchers previously introduced Dynamic Memory Sparsification (DMS) to compress KV-cache using learned per-head token eviction, and the author reports a near-lossless replication with ~6.4× KV-cache compression on Llama 3.2 1B.
To address a major bottleneck in the Hugging Face reference implementation (about 18 tok/s), the author developed “FastDMS,” an MIT-licensed implementation that uses compact KV storage and physically reclaims evicted slots.
FastDMS is tested against NVIDIA’s original Qwen 3 8B DMS checkpoint and the author’s own Llama 3.2 1B DMS checkpoint, with the repository including both the original HF reference and the trainer.
In the author’s benchmarks, FastDMS reduces KV memory usage by roughly 5–8× versus vLLM BF16 KV at 8K context and decodes about 1.5–2× faster than vLLM.
The author emphasizes that the savings are not just theoretical KV-byte reductions, but translate into real allocator/device memory savings under the tested workload configuration (e.g., ctx_len=8192, gen_len=128).

Last year researchers affiliated with NVIDIA, University of Warsaw, and University of Edinburgh published Dynamic Memory Sparsification (DMS), a KV-cache sparsification technique using learned per-head token eviction, reporting up to 8x KV-cache compression.

I found the results intriguing to build a small reference implementation and trainer to sanity-check the idea. On WikiText-2 with Llama 3.2 1B, I was able to get a rough replication:

Configuration	PPL	Delta	KLD (nats/tok)	Compression
Vanilla Llama-3.2-1B	9.226	-	-	1x
DMS (trained, eviction active)	9.200	-0.28%	0.026	6.4x

Training the DMS predictors took about 20 minutes on the PRO 6000 and the compression looked basically lossless. One small problem though, my HF reference implementation ran at about... 18 tok/s.

So, after a few weeks of kernel grinding, I'm pleased to announce FastDMS, an MIT-licensed implementation of DMS with compact KV storage that physically reclaims evicted slots. It is tested on NVIDIA's original Qwen 3 8B DMS checkpoint as well as my own Llama 3.2 1B DMS checkpoint. (the original HF reference version and my trainer are in the repo as well): https://github.com/shisa-ai/FastDMS

On my benchmark setup, FastDMS uses 5-8x less KV memory than vLLM BF16 KV at 8K context while also decoding 1.5-2X faster than vLLM.

Compact DMS saves real allocator/device memory, not just theoretical KV bytes. The table below uses ctx_len=8192, gen_len=128. All vLLM baselines use exact-sized token pools matching the workload. KV/stage memory is the cache or cache-plus-staging footprint. vLLM BF16 means dtype=bfloat16 with kv_cache_dtype=auto; vLLM FP8 means kv_cache_dtype=fp8.

Model / compact-DMS row	c	vLLM BF16 KV → FastDMS KV	BF16 KV saved	vLLM FP8 KV → FastDMS KV	FP8 KV saved	vLLM TQ4 KV → FastDMS KV	TQ4 KV saved
Llama-3.2-1B FastDMS default	1	`0.312 → 0.056 GiB`	`5.6x`	`0.156 → 0.056 GiB`	`2.8x`	`0.142 → 0.056 GiB`	`2.5x`
Llama-3.2-1B FastDMS default	8	`2.062 → 0.431 GiB`	`4.8x`	`1.031 → 0.431 GiB`	`2.4x`	`0.939 → 0.431 GiB`	`2.2x`
Qwen3-8B FastDMS compact DMS	1	`1.406 → 0.184 GiB`	`7.6x`	`0.703 → 0.184 GiB`	`3.8x`	—	—
Qwen3-8B FastDMS compact DMS	8	`9.281 → 1.462 GiB`	`6.3x`	`4.641 → 1.462 GiB`	`3.2x`	—	—

For those that are curious, yes, this beats out TurboQuant in both speed and memory usage:

Path	c	Prefill tok/s	Prefill vs BF16	Decode tok/s	Decode vs BF16	KV / stage memory	Status
vLLM BF16	1	`123098.0`	`1.00x`	`459.4`	`1.00x`	`0.312 GiB` BF16 KV	dense BF16-KV baseline
vLLM FP8	1	`119991.3`	`0.97x`	`489.4`	`1.07x`	`0.156 GiB` FP8 KV	dense FP8-KV baseline
vLLM TurboQuant `4bit_nc`	1	`126429.0`	`1.03x`	`333.4`	`0.73x`	`0.142 GiB` TQ4 KV	4-bit KV baseline
FastDMS FP8 compact-DMS default	1	`123194.6`	`1.00x`	`698.9`	`1.52x`	`0.056 GiB`	promoted zero-BF16 row
FastDMS B46 int4 speed profile	1	`121489.9`	`0.99x`	`1060.0`	`2.31x`	`0.056 GiB` + `0.719 GiB` int4 shadow	default-off storage-for-speed
vLLM BF16	8	`103668.5`	`1.00x`	`2357.5`	`1.00x`	`2.062 GiB` BF16 KV	dense BF16-KV baseline
vLLM FP8	8	`102959.5`	`0.99x`	`2888.7`	`1.23x`	`1.031 GiB` FP8 KV	dense FP8-KV baseline
vLLM TurboQuant `4bit_nc`	8	`104409.9`	`1.01x`	`1696.0`	`0.72x`	`0.939 GiB` TQ4 KV	4-bit KV baseline
FastDMS FP8 compact-DMS default	8	`105531.7`	`1.02x`	`3606.9`	`1.53x`	`0.431 GiB`	promoted zero-BF16 row
FastDMS B25 narrow int4 speed profile	8	`104753.7`	`1.01x`	`3640.7`	`1.54x`	`0.431 GiB` + `0.078 GiB` int4 shadow	default-off storage-for-speed
FastDMS BF16-attention speed control	8	`108070.5`	`1.04x`	`3745.3`	`1.59x`	`0.429 GiB` + `0.312 GiB` BF16 backing	explicit speed control

Of course, none of this matters if the compression tanks output quality. In theory, DMS eviction is applied before FP8 quantization, deciding which tokens to keep or evict, so the quality comparison for FastDMS compact-DMS should be the same versus FP8 quantization alone, but it's still worth double-checking quality.

This is measured by generating tokens with a compressed KV cache and comparing against an uncompressed reference, token by token. Lower KLD (KL divergence) is better - it means the compressed model's next-token probabilities are closer to the reference. Higher token match is better - it means greedy decoding produces the same output.

How to read the columns:

KLD vs ref - KL divergence in nats/token between the compressed and reference logits. Measures how much the probability distribution over next tokens shifts due to compression. Lower is better; 0.000 means identical.
Token match - percentage of greedy-decoded tokens that are identical to the reference. 96.9% means ~2 out of 64 tokens differed.
Tokens scored - how many decode steps could be compared. Once the candidate produces a different token than the reference, the sequences diverge and later steps aren't comparable. 33/60 means quality metrics only cover the first 33 tokens before divergence - the reported KLD and PPL are over that prefix, not the full generation. A higher ratio means the comparison is more complete.

Test setup: ctx_len=1024, decode_len=16, four prompts (60-64 total decode steps). vLLM rows compare against vLLM BF16 full-KV logits. FastDMS rows compare against FastDMS with eviction disabled (reference window of 1M tokens, effectively keeping the full KV cache).

shisa-ai/Llama-3.2-1B-DMS-8x

Path	Reference	KLD vs ref	Token match	PPL	Tokens scored
vLLM BF16 full KV	self	`0.000000`	`100.0%`	`2.3748`	`60/60`
vLLM FP8 KV	vLLM BF16	`0.005110`	`92.2%`	`2.0893`	`33/60`
vLLM TurboQuant `4bit_nc`	vLLM BF16	`0.012730`	`76.6%`	`1.9606`	`22/60`
FastDMS FP8 compact-DMS	FastDMS no-evict	`0.003009`	`96.9%`	`2.2810`	`64/64`

nvidia/Qwen3-8B-DMS-8x

Path	Reference	KLD vs ref	Token match	PPL	Tokens scored
vLLM BF16 full KV	self	`0.000000`	`100.0%`	`1.6738`	`60/60`
vLLM FP8 KV	vLLM BF16	`0.001042`	`70.3%`	`1.1971`	`32/60`
vLLM TurboQuant `4bit_nc`	vLLM BF16	`0.006039`	`84.4%`	`1.4910`	`45/60`
FastDMS FP8 compact-DMS	FastDMS no-evict	`0.005284`	`95.3%`	`1.8301`	`64/64`

FastDMS compact-DMS scores 64/64 tokens on both models - every decode step was comparable to the reference, and the KLD is lower than or comparable to vLLM's own FP8 and TurboQuant compression. Note that PPL values across rows are not directly comparable when Tokens scored differs, because each row's PPL is computed over a different-length prefix.

What's the catch?

So, if this is so darn great, why wasn't everyone using it already? Well, it turns out if you want to implement this in a production engine like vLLM, you have to do major surgery to it. DMS compact KV touches nearly every serving-engine subsystem:

Subsystem	What changes for DMS
PagedAttention / KV memory pool	DMS needs per-layer, per-head variable token counts with partial block deallocation - not standard fixed-page blocks
Prefill kernel	Must stream surviving K/V into compact per-layer storage after DMS extraction, rather than writing dense KV pages
Decode kernel	Each decode step evaluates per-head keep/evict, manages a sliding retention window, and appends to compact storage
Attention scoring	Replaced entirely: split-K grouped compact decode attention over variable-length per-head live spans
Scheduler / admission	Must admit requests based on compact KV capacity, not dense full-sequence page count - this is the hardest boundary
Prefix caching	DMS eviction is per-sequence and per-head; shared prefix blocks need per-sequence eviction overlays or must be disabled
Continuous batching	Memory accounting must reflect actual surviving token count, not logical sequence length

God bless anyone that wants to give this a swing. The kvcache compression seems real, and with a correct implementation there's no quality hit, and as shown by the FastDMS implementation, it looks like can run faster than non-DMS inferencing.

(lots more perf benchmarks, comparisons, and raw logs in the repo for those interested)

submitted by /u/randomfoo2
[link] [comments]

Black Hat USA

AI Business

The Agent Phone

Dev.to

OpenAI’s cozy partner Cerebras is on track for a blockbuster IPO

TechCrunch

Claude Code Skills: A Practical Guide for 2026

Dev.to

The Agentic Gap: Why a SharePoint Expert's Excitement Stopped Me Cold

Dev.to

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8

Key Points

shisa-ai/Llama-3.2-1B-DMS-8x

nvidia/Qwen3-8B-DMS-8x

What's the catch?

Related Articles

Black Hat USA

The Agent Phone

OpenAI’s cozy partner Cerebras is on track for a blockbuster IPO

Claude Code Skills: A Practical Guide for 2026

The Agentic Gap: Why a SharePoint Expert's Excitement Stopped Me Cold

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer