What I got by 5060Ti 16GB + Qwen3.6-35B-A3B-UD-Q5_K_M

Reddit r/LocalLLaMA / 4/18/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • A user running local LLMs on an AMD 9700X with an RTX 5060 Ti 16GB reports moving from Ollama to llama.ccp, seeing nearly double speed for Qwen 3.5 9B (Q8_K_M) and 35B (Q4_K_M).
  • They later followed advice from ChatGPT/Gemini to build llama.cpp locally for maximum optimization and claim an additional ~10% performance improvement.
  • Benchmarks using llama-bench show Qwen3.6-35B-A3B-UD-Q5_K_M runs on CUDA with 99 GPU layers and 22 CPU MoE layers, achieving high throughput (e.g., ~628 t/s for pp512 @ d131072 and ~32.56 t/s for tg128 @ d131072).
  • The results also reveal some confusion around the expected model naming (qwen35moe vs qwen36moe) despite downloading a Qwen3.6 GGUF from Hugging Face.
  • Overall, the article emphasizes that tooling choice (llama.ccp vs llama.cpp) and building/optimizing binaries can meaningfully affect local inference speed on limited VRAM setups.

I tried local model couple weeks ago.

At the beginning, I tried Ollama, but reddit says better to switch to llama.ccp.

then I switched to llama.ccp prebuild, it was amazing, I was very happy with llama.ccp, speed almostly doubled to run Qwen3.5 9 Q8_K_M, and Qwen3.5 35B-A3B Q4_K_M.

This week, Chatgpt and Gemini suggests me to build llama.cpp by on my PC to get max optimization.

I did it, and result made me happy again, almost 10% improved.

HW:

CPU: AMD 9700x

GPU: 5060 Ti 16GB

RAM: 16GB *2

Here the result:

It's confused to see qwen35moe 35B.A3B Q5_K - Medium, should be qwen36moe? download from unsloth/Qwen3.6-35B-A3B-GGUF · Hugging Face

.\llama-bench.exe -m models\Qwen3.6-35B-A3B-UD-Q5_K_M.gguf -ngl 99 --n-cpu-moe 22 -d 131072 -p 512 -n 128 --cache-type-k q8_0 --cache-type-v q8_0 -fa 1 -mmp 0

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16310 MiB):

Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 16310 MiB

| model | size | params | backend | ngl | n_cpu_moe | type_k | type_v | fa | mmap | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -----: | -----: | -: | ---: | --------------: | -------------------: |

| qwen35moe 35B.A3B Q5_K - Medium | 24.63 GiB | 34.66 B | CUDA | 99 | 22 | q8_0 | q8_0 | 1 | 0 | pp512 @ d131072 | 628.10 ± 2.80 |

| qwen35moe 35B.A3B Q5_K - Medium | 24.63 GiB | 34.66 B | CUDA | 99 | 22 | q8_0 | q8_0 | 1 | 0 | tg128 @ d131072 | 32.56 ± 0.32 |

submitted by /u/AdMinimum8193
[link] [comments]