Why exactly can't we use the techniques in TurboQuant on the model's quantizations themselves?

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The post asks for an ELI5-style explanation of why “TurboQuant” style techniques can’t be applied directly to quantized model weights themselves, even though similar methods have long been used on model caches.
It contrasts applying quantization/optimization techniques to the model’s internal representations versus applying them to the KV/cache format, where constraints and bottlenecks differ.
The question implies that the limiting factor is likely tied to how TurboQuant leverages properties of full-precision (or differently structured) tensors, which may not hold after weights are already quantized (e.g., scale/zero-point behavior or error characteristics).
It frames the discussion around commonly used quantization formats (e.g., Q4_0/Q4_1), suggesting that users expect similar compatibility but are encountering practical barriers.
Overall, it solicits clarification on the underlying technical constraints that prevent reusing the same approach across both model quantizations and cache quantizations.

Can someone ELI5? We've been using the same methods on both model and cache for a while (Q4_0/1, etc).

Dev.to

Dev.to

Dev.to

Dev.to

Dev.to