AI Navigate

RTX 5060 Ti 16GB Local LLM Findings: 30B Still Wins, 35B UD Is Surprisingly Fast

Reddit r/LocalLLaMA / 3/21/2026

💬 OpinionTools & Practical Usage

Key Points

  • The post documents practical findings for running local LLMs on a RTX 5060 Ti 16 GB with 32 GB RAM using llama.cpp/llama-server, focusing on which model paths work best rather than raw benchmarks.
  • The surprising takeaway is that the strongest real-world picks were not the smallest or heaviest options, with the 30B coder profile and the 35B UD-Q2_K_XL path outperforming alternatives on this hardware.
  • The author provides concrete size/quant benchmarks for several models (e.g., 88 tok/s for a 4B model, 76–80 tok/s for 30B UD-Q3_K_XL and 35B UD-Q2_K_XL), illustrating practical tradeoffs across models.
  • Practical recommendations are given: default coding model is Unsloth Qwen3-Coder-30B UD-Q3_K_XL; best higher-context coding is Unsloth 30B at 96k; best fast 35B is Unsloth Qwen3.5-35B UD-Q2_K_XL; 35B Q4_K_M is not the right default on this card; Windows vs Ubuntu results are similar but show slight differences.
RTX 5060 Ti 16GB Local LLM Findings: 30B Still Wins, 35B UD Is Surprisingly Fast

My first post here since I benefit a lot from reading. Bought 5060ti 16gb and tried various model.

This is the short version for me deciding what to run on this card with llama.cpp, not a giant benchmark dump.

Machine:

  • RTX 5060 Ti 16 GB
  • DDR4 now at 32 GB
  • llama-server b8373 (46dba9fce)

Relevant launch settings:

  • fast path: fa=on, ngl=auto, threads=8
  • KV: -ctk q8_0 -ctv q8_0
  • 30B coder path: jinja, reasoning-budget 0, reasoning-format none
  • 35B UD path: c=262144, n-cpu-moe=8
  • 35B Q4_K_M stable tune: -ngl 26 -c 131072 --fit on --fit-ctx 131072 --fit-target 512M

Short version:

  • Best default coding model: Unsloth Qwen3-Coder-30B UD-Q3_K_XL
  • Best higher-context coding option: the same Unsloth 30B model at 96k
  • Best fast 35B coding option: Unsloth Qwen3.5-35B UD-Q2_K_XL
  • Unsloth Qwen3.5-35B Q4_K_M is interesting, but still not the right default on this card

What surprised me most is that the practical winners here were not just “smaller is faster”. On this machine, the strongest real-world picks were still the 30B coder profile and the older 35B UD-Q2_K_XL path, not the smaller 9B route and not the heavier 35B Q4_K_M experiment.

Quick size / quant snapshot from the local data:

  • Jackrong Qwen 3.5 4B Q5_K_M: 88 tok/s
  • LuffyTheFox Qwen 3.5 9B Q4_K_M: 64 tok/s
  • Jackrong Qwen 3.5 27B Q3_K_S: ~20 tok/s
  • Unsloth Qwen 3.0 30B UD-Q3_K_XL: 76.3 tok/s
  • Unsloth Qwen 3.5 35B UD-Q2_K_XL: 80.1 tok/s

Matched Windows vs Ubuntu shortlist test:

  • same 20 questions
  • same 32k context
  • same max_tokens=800

Results:

  • Unsloth Qwen3-Coder-30B UD-Q3_K_XL
    • Windows: 79.5 tok/s, quality 7.94
    • Ubuntu: 76.3 tok/s, quality 8.14
  • Unsloth Qwen3.5-35B UD-Q2_K_XL
    • Windows: 72.3 tok/s, quality 7.40
    • Ubuntu: 80.1 tok/s, quality 7.39
  • Jackrong Qwen3.5-27B Claude-Opus Distilled Q3_K_S
    • Windows: 19.9 tok/s, quality 8.85
    • Ubuntu: ~20.0 tok/s, quality 8.21

That left the picture pretty clean:

  • Unsloth Qwen 3.0 30B is still the safest main recommendation
  • Unsloth Qwen 3.5 35B UD-Q2_K_XL is still the only 35B option here that actually feels fast
  • Jackrong Qwen 3.5 27B stays in the slower quality-first tier

The 35B Q4_K_M result is the main cautionary note.

I was able to make Unsloth Qwen3.5-35B-A3B Q4_K_M stable on this card with:

  • -ngl 26
  • -c 131072
  • -ctk q8_0 -ctv q8_0
  • --fit on --fit-ctx 131072 --fit-target 512M

But even with that tuning, it still did not beat the older Unsloth UD-Q2_K_XL path in practical use.

I also rechecked whether llama.cpp defaults were causing the odd Ubuntu result on Jackrong 27B. They were not.

Focused sweep on Ubuntu:

  • -fa on, auto parallel: 19.95 tok/s
  • -fa auto, auto parallel: 19.56 tok/s
  • -fa on, --parallel 1: 19.26 tok/s

So for that model:

  • flash-attn on vs auto barely changed anything
  • auto server parallel vs parallel=1 barely changed anything

Model links:

Bottom line:

  • Unsloth 30B coder is still the best practical recommendation for a 5060 Ti 16 GB
  • Unsloth 30B @ 96k is the upgrade path if you need more context
  • Unsloth 35B UD-Q2_K_XL is still the fast 35B coding option
  • Unsloth 35B Q4_K_M is useful to experiment with, but I would not daily-drive it on this hardware
submitted by /u/Imaginary-Anywhere23
[link] [comments]