NexQuant: Hardening 3-bit KV-Cache for the Edge. A Rust-native successor to Tom Turney’s TurboQuant+

Reddit r/LocalLLaMA / 4/1/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

NexQuant is presented as a production-hardened, Rust-native successor to Tom Turney’s TurboQuant+, targeting stable 3-bit KV-cache operation for running high-context LLMs on consumer hardware.
The project claims major memory savings, enabling 14B models to fit in roughly 4GB of VRAM/unified memory, largely through a hardened 3-bit KV-cache design.
It replaces previously noisy quantization/trajectory components with an MSE-only approach and reports passing 27/27 logic tests for stability.
NexQuant integrates sparse-V into the real-time decode loop and emphasizes “zero-alloc prefill” implemented in safe Rust to improve speed and reduce crash/memory-leak risk versus C++ prototypes.
It supports runtime dispatch across Metal, CUDA, and Vulkan, with CPU backends (AVX2/NEON) for broader hardware compatibility, and encourages feedback on Vulkan SPIR-V kernels via its GitHub repo.

NexQuant: Hardening 3-bit KV-Cache for the Edge. A Rust-native successor to Tom Turney’s TurboQuant+

We’ve been tracking the work of Tom Turney on TurboQuant+, and while the research was revolutionary, the implementation was still a bit "crawling" (noise issues, manual tuning, memory leaks).

We’ve spent the last 24hr building NexQuant - a production-hardened, Rust-native engine that allows you to run high-context models on consumer hardware that would normally choke.

What’s under the hood?

3-5x Memory Reduction: 14B models now fit comfortably in 4GB of VRAM/Unified Memory.
MSE-Only Stability: I’ve replaced the noisy QJL paths with a stable MSE-only trajectory. 27/27 logic tests passed.
Integrated Sparse-V: Sparsity isn't just a benchmark anymore; it’s integrated into the real-time decode loop.
Zero-Alloc Prefill: Written in 100% Safe Rust for maximum speed without the "Segfault" friction of C++ prototypes.

Hardware Support: Native runtime dispatch for Metal, CUDA, and Vulkan. If you have an old laptop or a Raspberry Pi, the CPU-AVX2/NEON backend will still keep you in the race.

Acknowledgements: This project is a synthesis of community intelligence. Massive credit to Tom Turney for the original PolarQuant/TurboQuant+ breakthroughs that proved 3-bit KV-caches were mathematically possible. We also want to acknowledge Claude (Anthropic) for acting as a high-speed pair programmer, helping us navigate the complexities of Walsh-Hadamard Transforms and Rust GGUF parsing.

The Mission: The goal is to ensure that even as models scale, the ability to run them remains local and decentralized.

GitHub: https://github.com/Ainix-dev/NexQuant

Let’s get this to light-speed. Feedback on the Vulkan SPIR-V kernels is especially welcome.

submitted by /u/SpiritOk6612
[link] [comments]