| So this might very well be user error on my end but please let me know if whatever I am doing is somehow wrong:
Consistently I see the LLM not doing what it is said to do. For example - I have some here:
So this has been my experience - let me know if I'm doing anything obviously wrong or whether this is a case where I just simply have to tone down my expectations. I know I can't have SOTA like expectations for model of this size but idk if I'm miscalibrated or not - But I think because a lot of hype with this Gemma 4 release - I thought it would be something that is able to call tools reliably vs my experience with some older models (GPT-OSS 20B/Qwen 3 Next/Qwen 3 coder models - the gpt 20b version used to do this "I'll call the tool" and would just stop - the qwen models were better) So not sure whether this is a calibration problem/I don't have a proper system prompt that works well with this model on opencode/I have some settings that are wrong. [link] [comments] |
Gemma 4 26B on oMLX with OpenCode, M4 Max, 64GB unified - am I doing something wrong/miscalibrated on capabilities here?
Reddit r/LocalLLaMA / 4/13/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- A Reddit user asks whether they are misconfiguring or misunderstanding capabilities when running Gemma 4 26B (4-bit, 200k context) on oMLX 0.3.5dev1 using an OpenCode harness on an M4 Max (64GB unified memory) setup.
- They report that the model sometimes fails to “think” despite a high thinkingBudget setting and may stop after announcing a tool call without executing it, raising questions about the reasoning parser and chat-template/tool-call handling.
- They observe slower token generation than other users who run on similar hardware, and they suspect the much larger 200k context might be the main cause.
- They also see repetition loops with default repetition penalty and wonder whether this behavior has been improved or patched in later oMLX versions.
- The discussion is essentially a troubleshooting thread seeking guidance on correct oMLX/Opencode configuration (e.g., reasoning_parser choice and relevant runtime parameters).


