Experiment: Entropy + OLS + SVD for KV cache compression

Reddit r/LocalLLaMA / 4/19/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The author investigates KV cache optimization methods beyond standard Top-K pruning, noting that pruning can fail in a selective way where a few tokens trigger large error spikes.
They propose a three-stage approach: using entropy for token selection, OLS for reconstruction, and SVD for compression.
Early prototype results show about 3× lower error at low memory usage compared with their baseline while also reducing the occurrence of severe error spikes.
In some cases, the method achieves both lower error and lower memory usage, but the work is still experimental and invites feedback on potential failure modes.

I’ve been exploring KV cache optimization beyond Top-K pruning.

Observation: pruning fails *selectively* - a few tokens cause large error spikes.

So I tried:

- entropy (selection)
- OLS (reconstruction)
- SVD (compression)

Early results:

- ~3× lower error at low memory
- avoids error spikes
- sometimes even lower memory

Still a prototype - would love feedback, especially where this might break.

Reddit r/MachineLearning

Dev.to

Reddit r/LocalLLaMA

Dev.to

Reddit r/artificial