Experiment: Entropy + OLS + SVD for KV cache compression

Reddit r/LocalLLaMA / 4/19/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The author investigates KV cache optimization methods beyond standard Top-K pruning, noting that pruning can fail in a selective way where a few tokens trigger large error spikes.
  • They propose a three-stage approach: using entropy for token selection, OLS for reconstruction, and SVD for compression.
  • Early prototype results show about 3× lower error at low memory usage compared with their baseline while also reducing the occurrence of severe error spikes.
  • In some cases, the method achieves both lower error and lower memory usage, but the work is still experimental and invites feedback on potential failure modes.

I’ve been exploring KV cache optimization beyond Top-K pruning.

Observation: pruning fails *selectively* - a few tokens cause large error spikes.

So I tried:

- entropy (selection)
- OLS (reconstruction)
- SVD (compression)

Early results:

- ~3× lower error at low memory
- avoids error spikes
- sometimes even lower memory

Blog: https://jchandra.com/posts/hae-ols/

Still a prototype - would love feedback, especially where this might break.

submitted by /u/Many_Perception_1703
[link] [comments]