| My first post here since I benefit a lot from reading. Bought 5060ti 16gb and tried various model. This is the short version for me deciding what to run on this card with Machine:
Relevant launch settings:
Short version:
What surprised me most is that the practical winners here were not just “smaller is faster”. On this machine, the strongest real-world picks were still the Quick size / quant snapshot from the local data:
Matched Windows vs Ubuntu shortlist test:
Results:
That left the picture pretty clean:
The 35B I was able to make
But even with that tuning, it still did not beat the older I also rechecked whether llama.cpp defaults were causing the odd Ubuntu result on Focused sweep on Ubuntu:
So for that model:
Model links:
Bottom line:
[link] [comments] |
RTX 5060 Ti 16GB Local LLM Findings: 30B Still Wins, 35B UD Is Surprisingly Fast
Reddit r/LocalLLaMA / 3/21/2026
💬 OpinionTools & Practical Usage
Key Points
- The post documents practical findings for running local LLMs on a RTX 5060 Ti 16 GB with 32 GB RAM using llama.cpp/llama-server, focusing on which model paths work best rather than raw benchmarks.
- The surprising takeaway is that the strongest real-world picks were not the smallest or heaviest options, with the 30B coder profile and the 35B UD-Q2_K_XL path outperforming alternatives on this hardware.
- The author provides concrete size/quant benchmarks for several models (e.g., 88 tok/s for a 4B model, 76–80 tok/s for 30B UD-Q3_K_XL and 35B UD-Q2_K_XL), illustrating practical tradeoffs across models.
- Practical recommendations are given: default coding model is Unsloth Qwen3-Coder-30B UD-Q3_K_XL; best higher-context coding is Unsloth 30B at 96k; best fast 35B is Unsloth Qwen3.5-35B UD-Q2_K_XL; 35B Q4_K_M is not the right default on this card; Windows vs Ubuntu results are similar but show slight differences.




