Google's TurboQuant saves memory, but won't save us from DRAM-pricing hell
Chocolate Factory’s compression tech clears the way to cheaper AI inference, not more affordable memory
When Google unveiled TurboQuant, an AI data compression technology that promises to slash the amount of memory required to serve models, many hoped it would help with a memory shortage that has seen prices triple since last year. Not so much.
TurboQuant isn't the savior you might be hoping for. Having said that, the underlying technology is still worth a closer look as it has major implications for model devs and inference providers.
What the heck is TurboQuant
Detailed by Google researchers in a recent blog post, TurboQuant is essentially a method of compressing data used in generative AI from higher to lower precisions, an approach commonly referred to as quantization.
According to researchers, TurboQuant has the potential to cut memory consumption during inference by at least 6x, a bold claim at a time when DRAM and NAND prices are at record highs.
However, unlike most quantization methods, TurboQuant doesn't shrink the model. Instead it aims to reduce the amount of memory required to store the key value (KV) caches used to maintain context during LLM inference.
In a nutshell, the KV cache is a bit like the model's short-term memory. During a chat session, for example, the KV cache is how the model keeps track of your conversation.
Where things get tricky is that these KV caches can pile up quite quickly, often consuming more memory than the model itself.
Usually, these KV caches are stored at 16-bit precision, so if you can shrink the number of bits used to store them to eight or even four bits, you can reduce the memory required by a factor of 2x to 4x.
While TurboQuant has certainly brought attention to KV cache quantization, the overarching idea isn't new. In fact, it's quite common for inference engines to store KV caches at FP8 for these reasons.
However, this kind of quantization isn't free. Lower precision means fewer bits to store key values and therefore less memory. These quantization methods also tend to introduce their own performance overheads.
This is really where TurboQuant's innovations lie. Google claims that it can achieve quality similar to BF16 using just 3.5 bits, while also mitigating those pesky overheads. At 4 bits, they claim as much as an 8x speedup on H100s when computing attention logits used to decide what in the context is or isn't important to the request.
And the researchers didn't stop there. In testing, they found they could crush the KV caches to 2.5 bits with minimal quality loss, which is where the claimed 6x memory reduction appears to have come from.
How does it work
TurboQuant is able to achieve this feat by combining two mathematical approaches: Quantized Johnson-Lindenstrauss (QJL) and PolarQuant.
PolarQuant works by mapping KV-cache vectors, which are just high-dimensional mathematical expressions of magnitude and direction, onto a circular grid that uses polar rather than Cartesian coordinates.
"This is comparable to replacing 'Go 3 blocks east, 4 blocks north' with 'go 5 blocks total at a 37-degree angle,'" Google's blog post explains.
Using this approach, the vector's magnitude and direction are now represented by its radius and angle, which the search giant explains eliminates the memory overhead associated with data normalization as each vector now shares a common reference point.
In addition to PolarQuant, Google also employs QJL to correct any errors introduced during the first phase and preserve the accuracy of the attention score used by the model to determine what information is or isn't important to serving a request.
The result is that these vectors can be stored using a fraction of memory. And this tech isn't limited to KV caches either. According to Google, the technology also has implications for vector databases used by search engines.
- OpenAI gets $122B to 'just build things' as the world blows them up
- Raspberry Pi leans into semiconductors as sales climb – especially in US and China
- Arm says agentic AI needs a new kind of CPU. Intel's DC chief isn't buying it
- Memory-makers' shares are down. Some RAM prices have eased. Blaming Google is not a good idea
Why TurboQuant won't deliver us from memory mayhem
With a claimed compression ratio of 6:1, it's not surprising that many on Wall Street tied memory makers' downward spirals to the introduction of TurboQuant.
But while the tech is likely to make AI inference clusters more efficient and therefore less expensive to operate, it's unlikely to curb demand for the NAND flash and DRAM memory used to store those KV-caches.
A year ago, open weights models like DeepSeek R1 offered context windows ranging from 64,000 to 256,000 tokens. Today, it's not uncommon to find open models sporting context windows exceeding one million tokens.
TurboQuant could allow an inference provider to make do with less memory, or let them serve up models with larger context windows. With code assistants and agentic frameworks like OpenClaw driving demand for larger context windows, the latter strikes us as the more likely of the two.
It seems that the industry watchers at TrendForce would agree. In a report published earlier this week, they predicted that TurboQuant will spark demand for long-context applications that drive demand for more memory rather than curb it. ®
More about
More about
Narrower topics
- AIOps
- Android
- App stores
- Chrome
- Chromium
- DeepSeek
- Disaster recovery
- Gemini
- Google Brain
- Google Cloud Platform
- Google I/O
- Google Nest
- Google Project Zero
- GPT-3
- GPT-4
- G Suite
- Kubernetes
- Large Language Model
- Machine Learning
- MCubed
- Neural Networks
- NLP
- Open Compute Project
- Pixel
- Privacy Sandbox
- PUE
- Retrieval Augmented Generation
- Software defined data center
- Star Wars
- Tavis Ormandy
- Tensor Processing Unit
- TOPS
Broader topics
More about
More about
More about
Narrower topics
- AIOps
- Android
- App stores
- Chrome
- Chromium
- DeepSeek
- Disaster recovery
- Gemini
- Google Brain
- Google Cloud Platform
- Google I/O
- Google Nest
- Google Project Zero
- GPT-3
- GPT-4
- G Suite
- Kubernetes
- Large Language Model
- Machine Learning
- MCubed
- Neural Networks
- NLP
- Open Compute Project
- Pixel
- Privacy Sandbox
- PUE
- Retrieval Augmented Generation
- Software defined data center
- Star Wars
- Tavis Ormandy
- Tensor Processing Unit
- TOPS




