Llama.cpp with Turboquant, Heavy-Hitter Oracle (H2O), and StreamingLLM. Even more performance!

Reddit r/LocalLLaMA / 3/28/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

A new Llama.cpp integration combines Turboquant with H2O Heavy-Hitter Oracle (Heavy-Hitter Oracle/H2O) and StreamingLLM to deliver additional complementary performance speedups.
The author reports that both CPU and CUDA builds are functional, enabling full-speed token generation on a 16GB RTX 4060 Ti while running Qwen 3.5 4B at extremely large context windows (256k+).
The project provides setup guidance via DEEPDIVE.md and README_TURBOQUANT.md, with installation/run details distributed across the repository documentation.
Users are encouraged to consult the linked GitHub repo for technical specifics and to submit questions or issues for further improvements.
The post positions the approach as a practical path to higher throughput and longer-context local inference using Llama.cpp-style deployments.

After the great work yesterday of TheTom's work on showing Turboquant working in Llama.cpp I added a few other things that added some more complimentary speedups to Llama.cpp. so far CPU and CUDA build and are fully usable. I'm seeing full speed token generation on my 16gb 4060ti up to 256k+ context window using Qwen 3.5 4B, which is pretty insane.

check out the DEEPDIVE.md for all the technical details and the README_TURBOQUANT.md to get up and running.

if you have any questions or have any suggestions please hit me up or post a GitHub issue.

https://github.com/peva3/turboquant-h2o-streamingllm

submitted by /u/peva3
[link] [comments]