| We’ve been tracking the work of Tom Turney on TurboQuant+, and while the research was revolutionary, the implementation was still a bit "crawling" (noise issues, manual tuning, memory leaks). We’ve spent the last 24hr building NexQuant - a production-hardened, Rust-native engine that allows you to run high-context models on consumer hardware that would normally choke. What’s under the hood?
Hardware Support: Native runtime dispatch for Metal, CUDA, and Vulkan. If you have an old laptop or a Raspberry Pi, the CPU-AVX2/NEON backend will still keep you in the race. Acknowledgements: This project is a synthesis of community intelligence. Massive credit to Tom Turney for the original PolarQuant/TurboQuant+ breakthroughs that proved 3-bit KV-caches were mathematically possible. We also want to acknowledge Claude (Anthropic) for acting as a high-speed pair programmer, helping us navigate the complexities of Walsh-Hadamard Transforms and Rust GGUF parsing. The Mission: The goal is to ensure that even as models scale, the ability to run them remains local and decentralized. GitHub: https://github.com/Ainix-dev/NexQuant Let’s get this to light-speed. Feedback on the Vulkan SPIR-V kernels is especially welcome. [link] [comments] |
NexQuant: Hardening 3-bit KV-Cache for the Edge. A Rust-native successor to Tom Turney’s TurboQuant+
Reddit r/LocalLLaMA / 4/1/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- NexQuant is presented as a production-hardened, Rust-native successor to Tom Turney’s TurboQuant+, targeting stable 3-bit KV-cache operation for running high-context LLMs on consumer hardware.
- The project claims major memory savings, enabling 14B models to fit in roughly 4GB of VRAM/unified memory, largely through a hardened 3-bit KV-cache design.
- It replaces previously noisy quantization/trajectory components with an MSE-only approach and reports passing 27/27 logic tests for stability.
- NexQuant integrates sparse-V into the real-time decode loop and emphasizes “zero-alloc prefill” implemented in safe Rust to improve speed and reduce crash/memory-leak risk versus C++ prototypes.
- It supports runtime dispatch across Metal, CUDA, and Vulkan, with CPU backends (AVX2/NEON) for broader hardware compatibility, and encourages feedback on Vulkan SPIR-V kernels via its GitHub repo.


