Benchmarked 18 models that I can run on my RTX 5080 16GB using Nick Lothian's SQL benchmark

Reddit r/LocalLLaMA / 4/1/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Read original →

共有:

Key Points

The post benchmarks 18 local LLM models that can run on an RTX 5080 16GB, using Nick Lothian’s SQL benchmark and noting results across different GGUF quantizations.
Qwen3.5-9B-Claude-4.6-HighIQ-THINKING is highlighted as the biggest surprise, improving markedly (from 5 to 16 “green tests”) and showing unusually reliable tool-calling compared with the smaller Qwen3.5-9B variant.
Qwen3.5-122B-A10B is described as the top performer for 16GB GPUs because its “experts” can be offloaded to system RAM, enabling high quality at the cost of imperfect speed.
The author provides practical RAM guidance: Q4_K_XL models can require ~68GB RAM and IQ3_XXS around ~33GB RAM, making smaller quants workable with 64GB system RAM.
The benchmarks are framed as an isolated SQL/tool-calling test, so the author cautions they may not fully represent real-world performance in larger codebase or broader context understanding.

2 days ago there was a very cool post by u/nickl:

https://reddit.com/r/LocalLLaMA/comments/1s7r9wu/

Highly recommend checking it out!

I've run this benchmark on a bunch of local models that can fit into my RTX 5080, some of them partially offloaded to RAM (I have 96GB, but most will fit if you have 64).

Results:

24: unsloth/Qwen3.5-122B-A10B-GGUF:UD-Q4_K_XL 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟩🟥🟩 🟩🟩🟩🟩🟩 23: bartowski/Qwen_Qwen3.5-27B-GGUF:IQ4_XS 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟩🟥🟩 🟥🟩🟩🟩🟩 23: unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟩🟥🟩 🟥🟩🟩🟩🟩 22: unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q6_K_XL 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟥🟩🟩 🟩🟩🟩🟥🟩 🟥🟩🟩🟩🟩 22: mradermacher/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-i1-GGUF:Q3_K_M 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟥🟩🟥🟩 🟥🟩🟩🟩🟩 22: Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF:Q4_K_M 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟥🟥🟩 🟥🟩🟩🟩🟩 21: unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-Q4_K_S 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟩🟨🟥 🟥🟨🟩🟩🟩 20: unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL 🟩🟩🟩🟩🟨 🟩🟩🟩🟩🟩 🟩🟩🟨🟩🟩 🟩🟩🟩🟥🟨 🟥🟩🟩🟩🟩 20: mradermacher/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-i1-GGUF:Q6_K 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟥🟩🟩 🟥🟩🟩🟥🟩 🟥🟥🟩🟩🟩 19: unsloth/GLM-4.7-Flash-GGUF:UD-Q6_K_XL 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟥🟩🟩 🟩🟩🟩🟥🟨 🟥🟨🟩🟥🟩 18: unsloth/GLM-4.5-Air-GGUF:Q5_K_M 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟥🟩🟩 🟥🟩🟩🟥🟩 🟨🟨🟥🟩🟨 18: bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF:Q6_K_L 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟨🟩🟩 🟩🟩🟩🟥🟩 🟨🟨🟥🟨🟨 17: Jackrong/Qwopus3.5-9B-v3-GGUF:Q8_0 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟥🟥🟩🟩 🟥🟩🟥🟥🟥 🟥🟩🟩🟩🟨 16: unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL 🟩🟩🟩🟩🟨 🟩🟩🟩🟩🟩 🟩🟩🟨🟩🟩 🟥🟨🟩🟥🟨 🟥🟨🟩🟨🟩 16: byteshape/Devstral-Small-2-24B-Instruct-2512-GGUF:IQ3_S 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟥🟩🟨🟩🟩 🟩🟩🟨🟥🟨 🟨🟨🟥🟨🟩 16: mradermacher/Qwen3.5-9B-Claude-4.6-HighIQ-THINKING-i1-GGUF:Q6_K 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟩🟩🟨🟥🟩 🟥🟩🟥🟥🟨 🟥🟩🟥🟩🟨 14: mradermacher/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT-i1-GGUF:Q6_K 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟥🟩🟥🟩🟩 🟩🟨🟥🟥🟨 🟨🟨🟥🟨🟨 14: unsloth/GLM-4.6V-GGUF:Q3_K_S 🟩🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟥🟩🟨🟨🟩 🟥🟩🟩🟨🟨 🟨🟨🟨🟨🟨 5: bartowski/Tesslate_OmniCoder-9B-GGUF:Q6_K_L 🟨🟨🟨🟨🟨 🟨🟨🟨🟩🟩 🟩🟨🟨🟩🟨 🟨🟨🟩🟨🟨 🟨🟨🟨🟨🟨 5: unsloth/Qwen3.5-9B-GGUF:UD-Q6_K_XL 🟨🟨🟨🟨🟨 🟨🟨🟨🟩🟩 🟨🟩🟨🟨🟩 🟨🟩🟨🟨🟨 🟨🟨🟨🟨🟨

The biggest surprise is Qwen3.5-9B-Claude-4.6-HighIQ-THINKING to be honest, going from 5 green tests with Qwen3.5-9B to 16 green tests. Most errors of Qwen3.5-9B boiled down to being unable to call the tools with correct formatting. For how small it is it's a very reliable finetune.

Qwen3.5-122B-A10B is still king with 16GB GPUs because I can offload experts to RAM. Speed isn't perfect but the quality is great and I can fit a sizable context into VRAM. Q4_K_XL uses around 68GB RAM, IQ3_XXS around 33GB RAM, so the smaller quant can be used with 64GB system RAM.

Note though - these benchmarks mostly test a pretty isolated SQL call. It's a nice quick benchmark to compare two models, even with tool calling, but it's not representative of a larger codebase context understanding where larger models will pull ahead.

Edit: added a 9B Qwopus model

submitted by /u/grumd
[link] [comments]