Hey folks, I ran a series of benchmarks comparing ik_llama.cpp against the official llama.cpp across multiple Qwen3 and Qwen3.5 variants (including MoE architectures). The results showed some interesting performance flips depending on the model architecture and backend provider.
Hardware:
- CPU: Ryzen 9 5950x
- RAM: 64GB DDR4
- GPU: RTX 5070 Ti
1. Qwen3-Coder-Next (MoE) All prompts were 22,568 tokens
llama-server --model ~/llm/models/unsloth/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-UD-Q4_K_XL.gguf --host 0.0.0.0 --port 8001 --ctx-size 100000 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --n-gpu-layers 999 -ot ".ffn_.*_exps.=CPU" --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --api-key local-llm Comparison across providers (unsloth, bartowski, ubergarm). The trend is consistent: ik_llama significantly outperforms llama.cpp on prompt processing.
| Model Provider | Quantization | Backend | Prompt Speed (t/s) | Gen Speed (t/s) |
|---|---|---|---|---|
| unsloth | Q4_K_XL | ik_llama.cpp | 451.28 | 33.68 |
| llama.cpp | 308.91 | 32.57 | ||
| unsloth | Q4_K_M | ik_llama.cpp | 454.73 | 33.72 |
| llama.cpp | 312.34 | 32.53 | ||
| bartowski | Q4_K_L | ik_llama.cpp | 440.89 | 33.61 |
| llama.cpp | 310.35 | 32.74 | ||
| ubergarm | Q4_0 | ik_llama.cpp | 423.68 | 33.97 |
| llama.cpp | 317.45 | 33.03 |
Observation: ik_llama.cpp is consistently ~35-40% faster on prompt processing for Qwen3-Coder models. Generation speeds are nearly identical.
2. Qwen3.5-35B-A3B (MoE)
llama-server -m ~/..../Qwen3.5-35B-A3B.gguf --host 0.0.0.0 --port 8001 -c 180000 -ngl 999 --n-cpu-moe 24 -fa on -t 16 -b 2048 -ub 2048 --no-mmap --jinja -ctk q8_0 -ctv q8_0 --repeat-penalty 1.1 --repeat-last-n 64 --temp 0.7 --top-p 0.9 --min-p 0.05 Here the trend flips. llama.cpp handles the larger MoE context better for prompt evaluation.
| Model Provider | Quantization | Backend | Prompt Speed (t/s) | Gen Speed (t/s) |
|---|---|---|---|---|
| ubergarm | Q4_0 | llama.cpp | 2,353.44 | 57.27 |
| ik_llama.cpp | 1,801.37 | 58.89 | ||
| unsloth | Q4_K_XL | llama.cpp | 2,201.10 | 53.88 |
| ik_llama.cpp | 1,726.10 | 58.13 | ||
| AesSedai | Q4_K_M | llama.cpp | Failed to Load | N/A |
| ik_llama.cpp | 1,746.11 | 57.81 |
Observation: llama.cpp is ~20-30% faster on prompt processing for Qwen3.5-35B. However, ik_llama generated significantly more tokens in some runs (higher generation output) and successfully loaded GGUFs that llama.cpp failed to process.
3. Qwen3.5-9B (Distilled/Reasoning)
llama-server -m ~/llm/models/mradermacher/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5-GGUF/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5.Q6_K.gguf --host 0.0.0.0 --port 8001 -c 131072 -ngl 999 -fa on -t 8 -b 2048 -ub 2048 --no-mmap --jinja -ctk q8_0 -ctv q8_0 --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.0 --repeat-penalty 1.0 Small MoE models show high prompt speeds, but generation behavior differs significantly.
| Model Provider | Quantization | Backend | Prompt Speed (t/s) | Gen Speed (t/s) |
|---|---|---|---|---|
| mradermacher | Crow-9B (Q6_K) | ik_llama.cpp | 4,149.83 | 73.18 |
| llama.cpp | 3,853.59 | 81.66 | ||
| mradermacher | Qwen3.5-9B (Q6_K) | llama.cpp | Parse Error | N/A |
| ik_llama.cpp | 4,146.30 | 77.36 |
Observation: ik_llama.cpp is faster on prompt processing for 9B models. Crucially, on the Crow-9B model, ik_llama generated ~5,500 tokens vs 588 tokens for llama.cpp. This suggests ik_llama may be better at handling Chain-of-Thought/Reasoning tokens or has different stopping criteria. llama.cpp also failed to parse one of the 9B GGUFs.
Analysis & Conclusion
1. The Performance Flip The performance advantage flips depending on the model architecture:
- Qwen3-Coder (22k):
ik_llama.cppdominates prompt processing (~450 t/s vs ~310 t/s). - Qwen3.5-35B (180k):
llama.cppdominates prompt processing (~2300 t/s vs ~1750 t/s). - Qwen3.5-9B: Both are comparable, with
ik_llamaslightly faster (~4150 t/s vs ~3850 t/s).
2. Generation Stability Generation speeds (tokens/s) are generally consistent between backends within ~5% variance. However, ik_llama.cpp appears to produce longer reasoning outputs on 9B models without crashing, whereas llama.cpp sometimes halted generation early (588 tokens vs 5520 tokens on Crow-9B).
3. Compatibility & Provider Optimization
- GGUF Stability:
ik_llama.cppshowed better stability with specific GGUF variants from certain sources (e.g., AesSedai 35B, MRadermacher 9B), whereasllama.cppencountered load failures and parse errors on the same files. - Ubergarm Note: Interestingly, ubergarm positions their models as being optimized for
ik_llama, but the test results show that isn't always the case for prompt processing. For example, on the Qwen3.5-35B-A3B-Q4_0 model,llama.cppwas ~30% faster on prompt tokens thanik_llama, despite the model's positioning.
Recommendation:
- Use
ik_llama.cppfor Qwen3-Coder Prompt Processing 50% faster - it's game changer in my case to use model with claude code - Use
llama.cppfor Qwen3.5-35B models (better prompt throughput). - Monitor generation length carefully, as backend differences may affect reasoning token counts significantly.
Questions:
- Has anyone encountered this performance flip between
ik_llama.cppandllama.cppon MoE models? - Did I mess up the launch parameters? Are there backend-specific flags I need for fair comparison (e.g.,
ik-specific MoE tweaks)?
[link] [comments]




