I've been running Qwen 3.5 27B Q4_K_M on a Blackwell RTX PRO 4000 (24GB) for agentic coding work and hit a wall with mainline llama.cpp. Switched to the ik_llama.cpp fork today and the difference is staggering. Posting real numbers in case it helps others.
Hardware Lenovo ThinkStation P520, Xeon W-2295 18-core, 128GB DDR4 ECC NVIDIA RTX PRO 4000 Blackwell 24GB GDDR7 Context: 131,072 tokens, KV cache q8_0/q4_0
Benchmark Results
Metric Mainline b8457 ik_llama.cpp b4370 Prompt eval ~43 tok/sec 1,122 tok/sec (26x) Generation ~7.5 tok/sec 26 tok/sec (3.5x) Graph splits 34 2 CPU during inference All threads pegged Idle GPU prompt processing Partial 100% GPU
Why the Difference
Qwen 3.5 uses a hybrid Gated Delta Network / Mamba-style SSM architecture interleaved with standard attention. Mainline llama.cpp was splitting this across 34 graph nodes with significant CPU involvement. ik_llama.cpp implements fused GDN kernels that handle the entire computation on CUDA, dropping graph splits from 34 to 2.
At startup with ik_llama.cpp you'll see:
fused Gated Delta Net (autoregressive) enabled fused Gated Delta Net (chunked) enabled graph splits = 2
That's the key difference. The model weights didn't change. The server did.
The Full Re-Processing Bug
Qwen 3.5's recurrent architecture still forces full prompt re-processing on every turn when the prompt changes (tracked in llama.cpp issue #20225). At 1,122 tok/sec this is tolerable — what took several minutes now takes seconds. But it's still happening on every turn. Something to be aware of.
Where to Get It
Pre-built Windows CUDA 12.8 binaries with AVX512 VNNI are available from the Thireus fork:
https://github.com/Thireus/ik_llama.cpp/releases
It's a drop-in replacement for your existing llama-server folder. Same command line arguments, same OpenAI-compatible API on port 1234.
For the W-2295 (AVX512 VNNI) grab: ik_llama-main-b4370-4d7223c-bin-win-cuda-12.8-x64-avx512_vnni.zip
Bottom Line
If you're running Qwen 3.5 on mainline llama.cpp and wondering why it's slow — this is why. The fused GDN kernels in ik_llama.cpp are not yet in mainline. Try the fork.
Happy to answer questions about the setup or benchmarking methodology.
[link] [comments]


