I tried local model couple weeks ago.
At the beginning, I tried Ollama, but reddit says better to switch to llama.ccp.
then I switched to llama.ccp prebuild, it was amazing, I was very happy with llama.ccp, speed almostly doubled to run Qwen3.5 9 Q8_K_M, and Qwen3.5 35B-A3B Q4_K_M.
This week, Chatgpt and Gemini suggests me to build llama.cpp by on my PC to get max optimization.
I did it, and result made me happy again, almost 10% improved.
HW:
CPU: AMD 9700x
GPU: 5060 Ti 16GB
RAM: 16GB *2
Here the result:
It's confused to see qwen35moe 35B.A3B Q5_K - Medium, should be qwen36moe? download from unsloth/Qwen3.6-35B-A3B-GGUF · Hugging Face
.\llama-bench.exe -m models\Qwen3.6-35B-A3B-UD-Q5_K_M.gguf -ngl 99 --n-cpu-moe 22 -d 131072 -p 512 -n 128 --cache-type-k q8_0 --cache-type-v q8_0 -fa 1 -mmp 0
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16310 MiB):
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 16310 MiB
| model | size | params | backend | ngl | n_cpu_moe | type_k | type_v | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -----: | -----: | -: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q5_K - Medium | 24.63 GiB | 34.66 B | CUDA | 99 | 22 | q8_0 | q8_0 | 1 | 0 | pp512 @ d131072 | 628.10 ± 2.80 |
| qwen35moe 35B.A3B Q5_K - Medium | 24.63 GiB | 34.66 B | CUDA | 99 | 22 | q8_0 | q8_0 | 1 | 0 | tg128 @ d131072 | 32.56 ± 0.32 |
[link] [comments]




