I think the 26B A4B MoE model is superior for 16 GB. I tested many quantizations, but if you want to keep the vision, I think the best one currently is:
https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/blob/main/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf
(I tested bartowski variants too, but unsloth has better reasoning for the size)
But you need some parameter tweaking for the best performance, especially for coding:
--temp 0.3 --top-p 0.9 --min-p 0.1 --top-k 20
Keeping the temp and top-k low and min-p a little high, it performs very well. So far no issues and it performs very close to the aistudio hosted model.
For vision use the mmproj-F16.gguf. FP32 gives no benefit at all, and very importantly:
--image-min-tokens 300 --image-max-tokens 1024
Use a minimum of 300 tokens for images, it increases vision performance a lot.
With this setup I can fit 30K+ tokens in KV fp16 with np -1. If you need more, I think it is better to drop the vision than going to KV Q8 as it makes it noticeably worse.
With this setup, I feel this model is an absolute beast for 16 GB VRAM.
Make sure to use the latest llama.cpp builds, or if you are using other UI wrappers, update its runtime version. (For now llama.cpp has another tokenizer issue on post b8660 builds, use b8660 for now which has tool call issue but for chatting it works)
In my testing compared to my previous daily driver (Qwen 3.5 27B):
- runs 80 tps+ vs 20 tps
- with --image-min-tokens 300 its vision is >= the Qwen 3 27B variant I run locally
- it has better multilingual support, much better
- it is superior for Systems & DevOps
- For real world coding which requires more updated libraries, it is much better because Qwen more often uses outdated modules
- for long context Qwen is still slightly better than this, but this is expected as it is an MoE
[link] [comments]




