Benchmark: ik_llama.cpp vs llama.cpp on Qwen3/3.5 MoE Models

Reddit r/LocalLLaMA / 3/15/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

Benchmark compares ik_llama.cpp and llama.cpp on Qwen3/Qwen3.5 MoE models using Ryzen 9 5950x, 64GB RAM, and RTX 5070 Ti.
Across provider/quantization combos (unsloth Q4_K_XL, unsloth Q4_K_M, bartowski Q4_K_L, ubbergarm Q4_0), ik_llama.cpp achieves higher prompt speeds while maintaining similar generation speeds.
The observed prompt speeds with ik_llama.cpp range roughly from 423 to 455 t/s, versus llama.cpp roughly 309 to 317 t/s, with generation speeds around 33.6 to 33.97 t/s for both.
The article notes a consistent 35-40% uplift in prompt processing for ik_llama.cpp across tested configurations, indicating meaningful performance gains for prompt-heavy workloads.

Hey folks, I ran a series of benchmarks comparing ik_llama.cpp against the official llama.cpp across multiple Qwen3 and Qwen3.5 variants (including MoE architectures). The results showed some interesting performance flips depending on the model architecture and backend provider.

Hardware:

CPU: Ryzen 9 5950x
RAM: 64GB DDR4
GPU: RTX 5070 Ti

1. Qwen3-Coder-Next (MoE) All prompts were 22,568 tokens

llama-server --model ~/llm/models/unsloth/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-UD-Q4_K_XL.gguf --host 0.0.0.0 --port 8001 --ctx-size 100000 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --n-gpu-layers 999 -ot ".ffn_.*_exps.=CPU" --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --api-key local-llm

Comparison across providers (unsloth, bartowski, ubergarm). The trend is consistent: ik_llama significantly outperforms llama.cpp on prompt processing.

Model Provider	Quantization	Backend	Prompt Speed (t/s)	Gen Speed (t/s)
unsloth	Q4_K_XL	ik_llama.cpp	451.28	33.68
		llama.cpp	308.91	32.57
unsloth	Q4_K_M	ik_llama.cpp	454.73	33.72
		llama.cpp	312.34	32.53
bartowski	Q4_K_L	ik_llama.cpp	440.89	33.61
		llama.cpp	310.35	32.74
ubergarm	Q4_0	ik_llama.cpp	423.68	33.97
		llama.cpp	317.45	33.03

Observation: ik_llama.cpp is consistently ~35-40% faster on prompt processing for Qwen3-Coder models. Generation speeds are nearly identical.

2. Qwen3.5-35B-A3B (MoE)

llama-server -m ~/..../Qwen3.5-35B-A3B.gguf --host 0.0.0.0 --port 8001 -c 180000 -ngl 999 --n-cpu-moe 24 -fa on -t 16 -b 2048 -ub 2048 --no-mmap --jinja -ctk q8_0 -ctv q8_0 --repeat-penalty 1.1 --repeat-last-n 64 --temp 0.7 --top-p 0.9 --min-p 0.05

Here the trend flips. llama.cpp handles the larger MoE context better for prompt evaluation.

Model Provider	Quantization	Backend	Prompt Speed (t/s)	Gen Speed (t/s)
ubergarm	Q4_0	llama.cpp	2,353.44	57.27
		ik_llama.cpp	1,801.37	58.89
unsloth	Q4_K_XL	llama.cpp	2,201.10	53.88
		ik_llama.cpp	1,726.10	58.13
AesSedai	Q4_K_M	llama.cpp	Failed to Load	N/A
		ik_llama.cpp	1,746.11	57.81

Observation: llama.cpp is ~20-30% faster on prompt processing for Qwen3.5-35B. However, ik_llama generated significantly more tokens in some runs (higher generation output) and successfully loaded GGUFs that llama.cpp failed to process.

3. Qwen3.5-9B (Distilled/Reasoning)

llama-server -m ~/llm/models/mradermacher/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5-GGUF/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5.Q6_K.gguf --host 0.0.0.0 --port 8001 -c 131072 -ngl 999 -fa on -t 8 -b 2048 -ub 2048 --no-mmap --jinja -ctk q8_0 -ctv q8_0 --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.0 --repeat-penalty 1.0

Small MoE models show high prompt speeds, but generation behavior differs significantly.

Model Provider	Quantization	Backend	Prompt Speed (t/s)	Gen Speed (t/s)
mradermacher	Crow-9B (Q6_K)	ik_llama.cpp	4,149.83	73.18
		llama.cpp	3,853.59	81.66
mradermacher	Qwen3.5-9B (Q6_K)	llama.cpp	Parse Error	N/A
		ik_llama.cpp	4,146.30	77.36

Observation: ik_llama.cpp is faster on prompt processing for 9B models. Crucially, on the Crow-9B model, ik_llama generated ~5,500 tokens vs 588 tokens for llama.cpp. This suggests ik_llama may be better at handling Chain-of-Thought/Reasoning tokens or has different stopping criteria. llama.cpp also failed to parse one of the 9B GGUFs.

Analysis & Conclusion

1. The Performance Flip The performance advantage flips depending on the model architecture:

Qwen3-Coder (22k): ik_llama.cpp dominates prompt processing (~450 t/s vs ~310 t/s).
Qwen3.5-35B (180k): llama.cpp dominates prompt processing (~2300 t/s vs ~1750 t/s).
Qwen3.5-9B: Both are comparable, with ik_llama slightly faster (~4150 t/s vs ~3850 t/s).

2. Generation Stability Generation speeds (tokens/s) are generally consistent between backends within ~5% variance. However, ik_llama.cpp appears to produce longer reasoning outputs on 9B models without crashing, whereas llama.cpp sometimes halted generation early (588 tokens vs 5520 tokens on Crow-9B).

3. Compatibility & Provider Optimization

GGUF Stability: ik_llama.cpp showed better stability with specific GGUF variants from certain sources (e.g., AesSedai 35B, MRadermacher 9B), whereas llama.cpp encountered load failures and parse errors on the same files.
Ubergarm Note: Interestingly, ubergarm positions their models as being optimized for ik_llama, but the test results show that isn't always the case for prompt processing. For example, on the Qwen3.5-35B-A3B-Q4_0 model, llama.cpp was ~30% faster on prompt tokens than ik_llama, despite the model's positioning.

Recommendation:

Use ik_llama.cpp for Qwen3-Coder Prompt Processing 50% faster - it's game changer in my case to use model with claude code
Use llama.cpp for Qwen3.5-35B models (better prompt throughput).
Monitor generation length carefully, as backend differences may affect reasoning token counts significantly.

Questions:

Has anyone encountered this performance flip between ik_llama.cpp and llama.cpp on MoE models?
Did I mess up the launch parameters? Are there backend-specific flags I need for fair comparison (e.g., ik-specific MoE tweaks)?

submitted by /u/Fast_Thing_7949
[link] [comments]

パナソニックHD、シンガポール開発拠点の視覚検査向けAIプラットフォームをグローバル展開初のライセンス提供のサムネイル画像

Ledge.ai

AIと創作

note

働くライター｜AI×note

note

まな式AI活用術で、人生が動き出した人たち

note

【教えてAI】「カメラのいらないテレビ電話」「POPOPO」って何？

note

Benchmark: ik_llama.cpp vs llama.cpp on Qwen3/3.5 MoE Models

Key Points

1. Qwen3-Coder-Next (MoE) All prompts were 22,568 tokens

2. Qwen3.5-35B-A3B (MoE)

3. Qwen3.5-9B (Distilled/Reasoning)

Analysis & Conclusion

Related Articles

パナソニックHD、シンガポール開発拠点の視覚検査向けAIプラットフォームをグローバル展開初のライセンス提供のサムネイル画像

AIと創作

働くライター｜AI×note

まな式AI活用術で、人生が動き出した人たち

【教えてAI】「カメラのいらないテレビ電話」「POPOPO」って何？

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

1. Qwen3-Coder-Next (MoE) All prompts were 22,568 tokens

2. Qwen3.5-35B-A3B (MoE)

3. Qwen3.5-9B (Distilled/Reasoning)

Analysis & Conclusion

Related Articles

パナソニックHD、シンガポール開発拠点の視覚検査向けAIプラットフォームをグローバル展開 初のライセンス提供 のサムネイル画像

AIと創作

働くライター｜AI×note

まな式AI活用術で、人生が動き出した人たち

【教えてAI】「カメラのいらないテレビ電話」「POPOPO」って何？

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

パナソニックHD、シンガポール開発拠点の視覚検査向けAIプラットフォームをグローバル展開初のライセンス提供のサムネイル画像