RTX 5060 Ti 16GB Local LLM Findings: 30B Still Wins, 35B UD Is Surprisingly Fast

Reddit r/LocalLLaMA / 3/21/2026

💬 OpinionTools & Practical Usage

Read original →

共有:

Key Points

The post documents practical findings for running local LLMs on a RTX 5060 Ti 16 GB with 32 GB RAM using llama.cpp/llama-server, focusing on which model paths work best rather than raw benchmarks.
The surprising takeaway is that the strongest real-world picks were not the smallest or heaviest options, with the 30B coder profile and the 35B UD-Q2_K_XL path outperforming alternatives on this hardware.
The author provides concrete size/quant benchmarks for several models (e.g., 88 tok/s for a 4B model, 76–80 tok/s for 30B UD-Q3_K_XL and 35B UD-Q2_K_XL), illustrating practical tradeoffs across models.
Practical recommendations are given: default coding model is Unsloth Qwen3-Coder-30B UD-Q3_K_XL; best higher-context coding is Unsloth 30B at 96k; best fast 35B is Unsloth Qwen3.5-35B UD-Q2_K_XL; 35B Q4_K_M is not the right default on this card; Windows vs Ubuntu results are similar but show slight differences.

RTX 5060 Ti 16GB Local LLM Findings: 30B Still Wins, 35B UD Is Surprisingly Fast

My first post here since I benefit a lot from reading. Bought 5060ti 16gb and tried various model.

This is the short version for me deciding what to run on this card with llama.cpp, not a giant benchmark dump.

Machine:

RTX 5060 Ti 16 GB
DDR4 now at 32 GB
llama-server b8373 (46dba9fce)

Relevant launch settings:

fast path: fa=on, ngl=auto, threads=8
KV: -ctk q8_0 -ctv q8_0
30B coder path: jinja, reasoning-budget 0, reasoning-format none
35B UD path: c=262144, n-cpu-moe=8
35B Q4_K_M stable tune: -ngl 26 -c 131072 --fit on --fit-ctx 131072 --fit-target 512M

Short version:

Best default coding model: Unsloth Qwen3-Coder-30B UD-Q3_K_XL
Best higher-context coding option: the same Unsloth 30B model at 96k
Best fast 35B coding option: Unsloth Qwen3.5-35B UD-Q2_K_XL
Unsloth Qwen3.5-35B Q4_K_M is interesting, but still not the right default on this card

What surprised me most is that the practical winners here were not just “smaller is faster”. On this machine, the strongest real-world picks were still the 30B coder profile and the older 35B UD-Q2_K_XL path, not the smaller 9B route and not the heavier 35B Q4_K_M experiment.

Quick size / quant snapshot from the local data:

Jackrong Qwen 3.5 4B Q5_K_M: 88 tok/s
LuffyTheFox Qwen 3.5 9B Q4_K_M: 64 tok/s
Jackrong Qwen 3.5 27B Q3_K_S: ~20 tok/s
Unsloth Qwen 3.0 30B UD-Q3_K_XL: 76.3 tok/s
Unsloth Qwen 3.5 35B UD-Q2_K_XL: 80.1 tok/s

Matched Windows vs Ubuntu shortlist test:

same 20 questions
same 32k context
same max_tokens=800

Results:

Unsloth Qwen3-Coder-30B UD-Q3_K_XL
- Windows: 79.5 tok/s, quality 7.94
- Ubuntu: 76.3 tok/s, quality 8.14
Unsloth Qwen3.5-35B UD-Q2_K_XL
- Windows: 72.3 tok/s, quality 7.40
- Ubuntu: 80.1 tok/s, quality 7.39
Jackrong Qwen3.5-27B Claude-Opus Distilled Q3_K_S
- Windows: 19.9 tok/s, quality 8.85
- Ubuntu: ~20.0 tok/s, quality 8.21

That left the picture pretty clean:

Unsloth Qwen 3.0 30B is still the safest main recommendation
Unsloth Qwen 3.5 35B UD-Q2_K_XL is still the only 35B option here that actually feels fast
Jackrong Qwen 3.5 27B stays in the slower quality-first tier

The 35B Q4_K_M result is the main cautionary note.

I was able to make Unsloth Qwen3.5-35B-A3B Q4_K_M stable on this card with:

-ngl 26
-c 131072
-ctk q8_0 -ctv q8_0
--fit on --fit-ctx 131072 --fit-target 512M

But even with that tuning, it still did not beat the older Unsloth UD-Q2_K_XL path in practical use.

I also rechecked whether llama.cpp defaults were causing the odd Ubuntu result on Jackrong 27B. They were not.

Focused sweep on Ubuntu:

-fa on, auto parallel: 19.95 tok/s
-fa auto, auto parallel: 19.56 tok/s
-fa on, --parallel 1: 19.26 tok/s

So for that model:

flash-attn on vs auto barely changed anything
auto server parallel vs parallel=1 barely changed anything

Model links:

Unsloth Qwen3-Coder-30B-A3B-Instruct-GGUF: https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
Unsloth Qwen3.5-35B-A3B-GGUF: https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF
Jackrong Qwen3.5-27B Claude-4.6 Opus Reasoning Distilled GGUF: https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
HauhauCS Qwen3.5-27B Uncensored Aggressive: https://huggingface.co/HauhauCS/Qwen3.5-27B-Uncensored-HauhauCS-Aggressive
Jackrong Qwen3.5-4B Claude-4.6 Opus Reasoning Distilled GGUF: https://huggingface.co/Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
LuffyTheFox Qwen3.5-9B Claude-4.6 Opus Uncensored Distilled GGUF: https://huggingface.co/LuffyTheFox/Qwen3.5-9B-Claude-4.6-Opus-Uncensored-Distilled-GGUF

Bottom line:

Unsloth 30B coder is still the best practical recommendation for a 5060 Ti 16 GB
Unsloth 30B @ 96k is the upgrade path if you need more context
Unsloth 35B UD-Q2_K_XL is still the fast 35B coding option
Unsloth 35B Q4_K_M is useful to experiment with, but I would not daily-drive it on this hardware

submitted by /u/Imaginary-Anywhere23
[link] [comments]

再現性とは何か | おじの解説 | 📗 AIを組織で回す技術 013

note

裏カツ奏 #AIイラスト #画像生成AI #アート #イラスト #生成AI #美女イラスト #創作 #クリエイター #イラストレーター

note

AIに聞く前に「自分の心」に聞け。40代がターゲットの「本当の痛み」を見抜く方法。

note

何でもAI時代でも電話対応は人にしてくれん？

note

【初心者向け】ひとりでもできる！Kindle出版に挑戦しよう！ | AIの使い方を考えてみよう編

note

RTX 5060 Ti 16GB Local LLM Findings: 30B Still Wins, 35B UD Is Surprisingly Fast

Key Points

Related Articles

再現性とは何か | おじの解説 | 📗 AIを組織で回す技術 013

裏カツ奏 #AIイラスト #画像生成AI #アート #イラスト #生成AI #美女イラスト #創作 #クリエイター #イラストレーター

AIに聞く前に「自分の心」に聞け。40代がターゲットの「本当の痛み」を見抜く方法。

何でもAI時代でも電話対応は人にしてくれん？

【初心者向け】ひとりでもできる！Kindle出版に挑戦しよう！ | AIの使い方を考えてみよう編

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Related Articles

再現性とは何か | おじの解説 | 📗 AIを組織で回す技術 013

裏カツ 奏 #AIイラスト #画像生成AI #アート #イラスト #生成AI #美女イラスト #創作 #クリエイター #イラストレーター

AIに聞く前に「自分の心」に聞け。40代がターゲットの「本当の痛み」を見抜く方法。

何でもAI時代でも電話対応は人にしてくれん？

【初心者向け】ひとりでもできる！Kindle出版に挑戦しよう！ | AIの使い方を考えてみよう編

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

裏カツ奏 #AIイラスト #画像生成AI #アート #イラスト #生成AI #美女イラスト #創作 #クリエイター #イラストレーター