TurboQuant in Llama.cpp benchmarks

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

A Reddit user benchmarks Google’s TurboQuant when running locally via llama.cpp, reporting that it appears to keep KV cache usage under control and work effectively for long-context scenarios.
Performance results are mixed on the user’s Apple Silicon/Metal setup, with TPS reportedly about 50% lower than FP16, while CUDA attempts produced unusable outputs, suggesting implementation/runtime tuning challenges.
The post argues TurboQuant could be a major enabler for consumer hardware, allowing users with ~8–12GB VRAM or ~16–32GB RAM to run “smarter” models with more reasonable context lengths.
The author expects a step-change in what tasks can be handled on-device (e.g., chained tool calls and injected context) without exhausting context windows, potentially shifting the scope of local LLM workflows.
Related early ecosystem efforts are mentioned for MLX and VLLM, but the user notes friction is likely because adoption is still early across implementations.

I wanted to self test the TurboQuant research from google but specifically via llama.cpp. The first image is from Aryan Kapoor on the PR for llama.cpp and the second is from myself messing with this using Metal on Apple Silicon. Its totally clear that this method does work with keeping KV in check. I think I took a wrong turn somewhere because my TPS on Metal is like 50% less than f16 - not sure why.

I did try to get some kernels working on a CUDA machine but I was getting absolutely garbage outputs so even though the KV savings were the same as others I def did something wrong. I'll leave that to the experts.

That being said, this all seems like a huge boon for people running local models. For reference I build AnythingLLM and the vast majority of people are on, at best, 8-12GB VRAM or just 16-32GB RAM devices and this would enable people to run "smarter" models with a reasonable context. For people who are GPU rich they can just stretch their legs a little further working up to 250K-1M.

Honestly, I am excited about this because right now while consumer hardware is getting better the idea of being limited to 16K so you can at least leave room for other apps on the device is pretty knee-capping for local models with even a modest conversation, tool call injection, and injected context.

To me, this still doesn't mean the death of RAG or anything like that. I just think we are going to see a step function in the scope of what you can reasonably do on device in terms of tasks. Right now any moderately complex task or chained tool call will exhaust most of a window - this can really open a lot more tasks to be done locally.

There is also a PR for MLX & VLLM is anyone wants to try to run some personal tests. Its certainly early on in development across the entire ecosystem so expect some friction there.

Some people think this will reduce cloud model token costs and honestly, I just expect them to do this (or already are with NVIDIA nvfp4 or something) and just keep the difference as margin - who knows.

submitted by /u/tcarambat
[link] [comments]

What Is Artificial Intelligence and How Does It Actually Work?

Dev.to

Forge – Turn Dev Conversations into Structured Decisions

Dev.to

Cortex – A Local-First Knowledge Graph for Developers

Dev.to

45 MCP Tools: Everything Your Claude Agent Can Do with a Wallet

Dev.to

SmartLead Architect: Building an AI-Driven Lead Scoring and Outreach Engine

Dev.to

TurboQuant in Llama.cpp benchmarks

Key Points

Related Articles

What Is Artificial Intelligence and How Does It Actually Work?

Forge – Turn Dev Conversations into Structured Decisions

Cortex – A Local-First Knowledge Graph for Developers

45 MCP Tools: Everything Your Claude Agent Can Do with a Wallet

SmartLead Architect: Building an AI-Driven Lead Scoring and Outreach Engine

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer