Gemma 4 26b A3B is mindblowingly good , if configured right

Reddit r/LocalLLaMA / 4/7/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author tests multiple local LLMs and quants on an RTX 3090 using LM Studio and reports persistent tool-calling glitches such as infinite loops on several models.
  • They find Gemma 4 26b (A3B) performs reliably for tool calling when run via Ollama/llama.cpp, with prompt caching working “flawlessly” and stable fast generation speeds.
  • Using Flash Attention and Q4 quantization, they report pushing context length up to 260k tokens on the 3090 while maintaining model performance.
  • The author shares specific inference settings (unsloth q3k_m quant, temperature 1, top-k 40) and a custom system prompt they believe improve function/tool-calling outcomes.
  • They conclude the model is especially strong for agentic coding/workflows and search-based plugins, but note high VRAM requirements for tool calling/agent use.

Last few days ive been trying different models and quants on my rtx 3090 LM studio , but every single one always glitches the tool calling , infinite loop that doesnt stop. But i really liked the model because it is rly fast , like 80-110 tokens a second , even on high contex it still maintains very high speeds.

I had great success with tool calling in qwen3.5 moe model , but the issue i had with qwen models is that there is some kind of bug in win11 and LM studio that makes the prompt caching not work so when the convo hits 30-40k contex , it is so slow at processing prompts it just kills my will to work with it.

Gemma 4 is different , it is much better supported on the ollama cpp and the caching works flawlesly , im using flash attention + q4 quants , with this i can push it to literally maximum 260k contex on rtx 3090 ! , and the models performs just aswell.

I finally found the one that works for me , its the unsloth q3k_m quant , temperature 1 and top k sampling 40. i have a custom system prompt that im using which also might be helping.

I've been testing it with opencode for the last 6 hours and i just cant stop , it cannot fail , it exiplained me the whole structure of the Open Code itself , and it is a huge , like the whole repo is 2.7GB so many lines of code and it has no issues traversing around and reading everything , explaining how certain things work , i think im gonna create my own version of open code in the end.

It honestly feels like claude sonnet level of quality , never fails to do function calling , i think this might be the best model for agentic coding / tool calling / open claw or search engine.
I prefer it over perplexity , in LM studio connected to search engine via a plugin delivers much better results than perplexity or google.

As for vram consumption it is heavy , it can probably work on 16gb it not for tool calling or agents , u need 10-15k contex just to start it. My gpu has 24gb ram so it can run it at full contex no issues on Q4_0 KV

submitted by /u/cviperr33
[link] [comments]