ik_llama.cpp gives 26x faster prompt processing on Qwen 3.5 27B — real world numbers

Reddit r/LocalLLaMA / 3/22/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • ik_llama.cpp fork delivers dramatic speedups on Qwen 3.5 27B, achieving about 1,122 tokens/sec for prompt evaluation vs 43 tokens/sec on mainline llama.cpp (roughly 26x faster) and 26 tokens/sec for generation vs 7.5 tokens/sec (about 3.5x).
  • The speedup comes from fused Gated Delta Network kernels that run the entire computation on CUDA, reducing graph splits from 34 to 2 and largely removing CPU bottlenecks.
  • It’s a drop-in replacement for your existing llama-server with the same OpenAI-compatible API; Thireus provides pre-built Windows binaries with CUDA 12.8 and AVX512_VNNI support for the W-2295 system.
  • A known caveat: Qwen 3.5’s architecture still performs full prompt re-processing on every turn; the fork makes it tolerable but does not eliminate this behavior.

I've been running Qwen 3.5 27B Q4_K_M on a Blackwell RTX PRO 4000 (24GB) for agentic coding work and hit a wall with mainline llama.cpp. Switched to the ik_llama.cpp fork today and the difference is staggering. Posting real numbers in case it helps others.

Hardware Lenovo ThinkStation P520, Xeon W-2295 18-core, 128GB DDR4 ECC NVIDIA RTX PRO 4000 Blackwell 24GB GDDR7 Context: 131,072 tokens, KV cache q8_0/q4_0

Benchmark Results

Metric Mainline b8457 ik_llama.cpp b4370 Prompt eval ~43 tok/sec 1,122 tok/sec (26x) Generation ~7.5 tok/sec 26 tok/sec (3.5x) Graph splits 34 2 CPU during inference All threads pegged Idle GPU prompt processing Partial 100% GPU

Why the Difference

Qwen 3.5 uses a hybrid Gated Delta Network / Mamba-style SSM architecture interleaved with standard attention. Mainline llama.cpp was splitting this across 34 graph nodes with significant CPU involvement. ik_llama.cpp implements fused GDN kernels that handle the entire computation on CUDA, dropping graph splits from 34 to 2.

At startup with ik_llama.cpp you'll see:

fused Gated Delta Net (autoregressive) enabled fused Gated Delta Net (chunked) enabled graph splits = 2

That's the key difference. The model weights didn't change. The server did.

The Full Re-Processing Bug

Qwen 3.5's recurrent architecture still forces full prompt re-processing on every turn when the prompt changes (tracked in llama.cpp issue #20225). At 1,122 tok/sec this is tolerable — what took several minutes now takes seconds. But it's still happening on every turn. Something to be aware of.

Where to Get It

Pre-built Windows CUDA 12.8 binaries with AVX512 VNNI are available from the Thireus fork:

https://github.com/Thireus/ik_llama.cpp/releases

It's a drop-in replacement for your existing llama-server folder. Same command line arguments, same OpenAI-compatible API on port 1234.

For the W-2295 (AVX512 VNNI) grab: ik_llama-main-b4370-4d7223c-bin-win-cuda-12.8-x64-avx512_vnni.zip

Bottom Line

If you're running Qwen 3.5 on mainline llama.cpp and wondering why it's slow — this is why. The fused GDN kernels in ik_llama.cpp are not yet in mainline. Try the fork.

Happy to answer questions about the setup or benchmarking methodology.

submitted by /u/New-Inspection7034
[link] [comments]