Gemma 4 26b A3B is mindblowingly good , if configured right

Reddit r/LocalLLaMA / 4/7/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author tests multiple local LLMs and quants on an RTX 3090 using LM Studio and reports persistent tool-calling glitches such as infinite loops on several models.
They find Gemma 4 26b (A3B) performs reliably for tool calling when run via Ollama/llama.cpp, with prompt caching working “flawlessly” and stable fast generation speeds.
Using Flash Attention and Q4 quantization, they report pushing context length up to 260k tokens on the 3090 while maintaining model performance.
The author shares specific inference settings (unsloth q3k_m quant, temperature 1, top-k 40) and a custom system prompt they believe improve function/tool-calling outcomes.
They conclude the model is especially strong for agentic coding/workflows and search-based plugins, but note high VRAM requirements for tool calling/agent use.

Last few days ive been trying different models and quants on my rtx 3090 LM studio , but every single one always glitches the tool calling , infinite loop that doesnt stop. But i really liked the model because it is rly fast , like 80-110 tokens a second , even on high contex it still maintains very high speeds.

I had great success with tool calling in qwen3.5 moe model , but the issue i had with qwen models is that there is some kind of bug in win11 and LM studio that makes the prompt caching not work so when the convo hits 30-40k contex , it is so slow at processing prompts it just kills my will to work with it.

Gemma 4 is different , it is much better supported on the ollama cpp and the caching works flawlesly , im using flash attention + q4 quants , with this i can push it to literally maximum 260k contex on rtx 3090 ! , and the models performs just aswell.

I finally found the one that works for me , its the unsloth q3k_m quant , temperature 1 and top k sampling 40. i have a custom system prompt that im using which also might be helping.

I've been testing it with opencode for the last 6 hours and i just cant stop , it cannot fail , it exiplained me the whole structure of the Open Code itself , and it is a huge , like the whole repo is 2.7GB so many lines of code and it has no issues traversing around and reading everything , explaining how certain things work , i think im gonna create my own version of open code in the end.

It honestly feels like claude sonnet level of quality , never fails to do function calling , i think this might be the best model for agentic coding / tool calling / open claw or search engine.
I prefer it over perplexity , in LM studio connected to search engine via a plugin delivers much better results than perplexity or google.

As for vram consumption it is heavy , it can probably work on 16gb it not for tool calling or agents , u need 10-15k contex just to start it. My gpu has 24gb ram so it can run it at full contex no issues on Q4_0 KV

submitted by /u/cviperr33
[link] [comments]