AI Navigate

Benchmark: ik_llama.cpp vs llama.cpp on Qwen3/3.5 MoE Models

Reddit r/LocalLLaMA / 3/15/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • Benchmark compares ik_llama.cpp and llama.cpp on Qwen3/Qwen3.5 MoE models using Ryzen 9 5950x, 64GB RAM, and RTX 5070 Ti.
  • Across provider/quantization combos (unsloth Q4_K_XL, unsloth Q4_K_M, bartowski Q4_K_L, ubbergarm Q4_0), ik_llama.cpp achieves higher prompt speeds while maintaining similar generation speeds.
  • The observed prompt speeds with ik_llama.cpp range roughly from 423 to 455 t/s, versus llama.cpp roughly 309 to 317 t/s, with generation speeds around 33.6 to 33.97 t/s for both.
  • The article notes a consistent 35-40% uplift in prompt processing for ik_llama.cpp across tested configurations, indicating meaningful performance gains for prompt-heavy workloads.

Hey folks, I ran a series of benchmarks comparing ik_llama.cpp against the official llama.cpp across multiple Qwen3 and Qwen3.5 variants (including MoE architectures). The results showed some interesting performance flips depending on the model architecture and backend provider.

Hardware:

  • CPU: Ryzen 9 5950x
  • RAM: 64GB DDR4
  • GPU: RTX 5070 Ti

1. Qwen3-Coder-Next (MoE) All prompts were 22,568 tokens

llama-server --model ~/llm/models/unsloth/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-UD-Q4_K_XL.gguf --host 0.0.0.0 --port 8001 --ctx-size 100000 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --n-gpu-layers 999 -ot ".ffn_.*_exps.=CPU" --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --api-key local-llm 

Comparison across providers (unsloth, bartowski, ubergarm). The trend is consistent: ik_llama significantly outperforms llama.cpp on prompt processing.

Model Provider Quantization Backend Prompt Speed (t/s) Gen Speed (t/s)
unsloth Q4_K_XL ik_llama.cpp 451.28 33.68
llama.cpp 308.91 32.57
unsloth Q4_K_M ik_llama.cpp 454.73 33.72
llama.cpp 312.34 32.53
bartowski Q4_K_L ik_llama.cpp 440.89 33.61
llama.cpp 310.35 32.74
ubergarm Q4_0 ik_llama.cpp 423.68 33.97
llama.cpp 317.45 33.03

Observation: ik_llama.cpp is consistently ~35-40% faster on prompt processing for Qwen3-Coder models. Generation speeds are nearly identical.

2. Qwen3.5-35B-A3B (MoE)

llama-server -m ~/..../Qwen3.5-35B-A3B.gguf --host 0.0.0.0 --port 8001 -c 180000 -ngl 999 --n-cpu-moe 24 -fa on -t 16 -b 2048 -ub 2048 --no-mmap --jinja -ctk q8_0 -ctv q8_0 --repeat-penalty 1.1 --repeat-last-n 64 --temp 0.7 --top-p 0.9 --min-p 0.05 

Here the trend flips. llama.cpp handles the larger MoE context better for prompt evaluation.

Model Provider Quantization Backend Prompt Speed (t/s) Gen Speed (t/s)
ubergarm Q4_0 llama.cpp 2,353.44 57.27
ik_llama.cpp 1,801.37 58.89
unsloth Q4_K_XL llama.cpp 2,201.10 53.88
ik_llama.cpp 1,726.10 58.13
AesSedai Q4_K_M llama.cpp Failed to Load N/A
ik_llama.cpp 1,746.11 57.81

Observation: llama.cpp is ~20-30% faster on prompt processing for Qwen3.5-35B. However, ik_llama generated significantly more tokens in some runs (higher generation output) and successfully loaded GGUFs that llama.cpp failed to process.

3. Qwen3.5-9B (Distilled/Reasoning)

llama-server -m ~/llm/models/mradermacher/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5-GGUF/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5.Q6_K.gguf --host 0.0.0.0 --port 8001 -c 131072 -ngl 999 -fa on -t 8 -b 2048 -ub 2048 --no-mmap --jinja -ctk q8_0 -ctv q8_0 --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.0 --repeat-penalty 1.0 

Small MoE models show high prompt speeds, but generation behavior differs significantly.

Model Provider Quantization Backend Prompt Speed (t/s) Gen Speed (t/s)
mradermacher Crow-9B (Q6_K) ik_llama.cpp 4,149.83 73.18
llama.cpp 3,853.59 81.66
mradermacher Qwen3.5-9B (Q6_K) llama.cpp Parse Error N/A
ik_llama.cpp 4,146.30 77.36

Observation: ik_llama.cpp is faster on prompt processing for 9B models. Crucially, on the Crow-9B model, ik_llama generated ~5,500 tokens vs 588 tokens for llama.cpp. This suggests ik_llama may be better at handling Chain-of-Thought/Reasoning tokens or has different stopping criteria. llama.cpp also failed to parse one of the 9B GGUFs.

Analysis & Conclusion

1. The Performance Flip The performance advantage flips depending on the model architecture:

  • Qwen3-Coder (22k): ik_llama.cpp dominates prompt processing (~450 t/s vs ~310 t/s).
  • Qwen3.5-35B (180k): llama.cpp dominates prompt processing (~2300 t/s vs ~1750 t/s).
  • Qwen3.5-9B: Both are comparable, with ik_llama slightly faster (~4150 t/s vs ~3850 t/s).

2. Generation Stability Generation speeds (tokens/s) are generally consistent between backends within ~5% variance. However, ik_llama.cpp appears to produce longer reasoning outputs on 9B models without crashing, whereas llama.cpp sometimes halted generation early (588 tokens vs 5520 tokens on Crow-9B).

3. Compatibility & Provider Optimization

  • GGUF Stability: ik_llama.cpp showed better stability with specific GGUF variants from certain sources (e.g., AesSedai 35B, MRadermacher 9B), whereas llama.cpp encountered load failures and parse errors on the same files.
  • Ubergarm Note: Interestingly, ubergarm positions their models as being optimized for ik_llama, but the test results show that isn't always the case for prompt processing. For example, on the Qwen3.5-35B-A3B-Q4_0 model, llama.cpp was ~30% faster on prompt tokens than ik_llama, despite the model's positioning.

Recommendation:

  • Use ik_llama.cpp for Qwen3-Coder Prompt Processing 50% faster - it's game changer in my case to use model with claude code
  • Use llama.cpp for Qwen3.5-35B models (better prompt throughput).
  • Monitor generation length carefully, as backend differences may affect reasoning token counts significantly.

Questions:

  • Has anyone encountered this performance flip between ik_llama.cpp and llama.cpp on MoE models?
  • Did I mess up the launch parameters? Are there backend-specific flags I need for fair comparison (e.g., ik-specific MoE tweaks)?
submitted by /u/Fast_Thing_7949
[link] [comments]