| I bought an RTX 5060 Ti 16GB around Christmas and had one goal: get a strong model running locally on my card without paying api fees. I have been testing local ai with open claw. I did not come into this with a quantization background. I only learned about llama, lmstudio and ollama two months ago. I just wanted something better than the usual Q3-class compromise (see my first post for benchmark). Many times, I like to buy 24gb card but looking at the price, I quickly turned away. When the TurboQuant paper came out, and when some shows memory can be saved in KV, I started wondering whether the same style of idea could help on weights, not just KV/ cache. After many long nights (until 2am) after work, that turned into a
This work is inspired by the broader transform-based quantization line, especially RaBitQ-style Walsh-Hadamard rotation ideas and the recent TurboQuant result (Tom). The thing I wanted to test was whether that same geometry could help on weights, not just KV/cache. Main Result on Qwen3.5-27B
That is a gap of only Size
So The practical point for me is simple:
So I’m not claiming “better than Q4_0” in general. I’m claiming something narrower and, I think, useful:
Speed record during perplexity test: - generation tg10: 15.55 tok/s Caveats
LinksI will open source the quantization steps when I have enough feedback and test. Update: Since a few saying I only compare to q4_0. Here is update. TQ3_4S will be published with faster processing speed
[link] [comments] |
TurboQuant isn’t just for KV: Qwen3.5-27B at near-Q4_0 quality, about 10% smaller, and finally fitting on my 16GB 5060 Ti
Reddit r/LocalLLaMA / 4/1/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- A local LLM user reports building a llama.cpp fork that introduces a TurboQuant-inspired 3.5-bit weight quantization format (TQ3_1S) with Walsh–Hadamard rotation, 8-centroid quantization, dual half-block scales, and CUDA runtime support.
- On Qwen3.5-27B (wiki.test.raw), TQ3_1S achieves near-Q4_0 quality with only ~0.0139 PPL (~0.19%) gap versus Q4_0, indicating the weight-quantization geometry can preserve performance.
- The new format reduces model size from ~14.4GB (Q4_0) to ~12.9GB (TQ3_1S), about 10% smaller while staying close in perplexity.
- The practical result is that the 27B model fully fits on a 16GB RTX 5060 Ti with TQ3_1S, whereas the Q4_0 variant does not fit under the same setup.
- The author positions this as a narrower, practical improvement (near-Q4_0 quality at smaller size) rather than a claim of universally better-than-Q4_0 quantization.




