After the great work yesterday of TheTom's work on showing Turboquant working in Llama.cpp I added a few other things that added some more complimentary speedups to Llama.cpp. so far CPU and CUDA build and are fully usable. I'm seeing full speed token generation on my 16gb 4060ti up to 256k+ context window using Qwen 3.5 4B, which is pretty insane.
check out the DEEPDIVE.md for all the technical details and the README_TURBOQUANT.md to get up and running.
if you have any questions or have any suggestions please hit me up or post a GitHub issue.
[link] [comments]




