Gemma 4 26B on oMLX with OpenCode, M4 Max, 64GB unified - am I doing something wrong/miscalibrated on capabilities here?

Reddit r/LocalLLaMA / 4/13/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Read original →

共有:

Key Points

A Reddit user asks whether they are misconfiguring or misunderstanding capabilities when running Gemma 4 26B (4-bit, 200k context) on oMLX 0.3.5dev1 using an OpenCode harness on an M4 Max (64GB unified memory) setup.
They report that the model sometimes fails to “think” despite a high thinkingBudget setting and may stop after announcing a tool call without executing it, raising questions about the reasoning parser and chat-template/tool-call handling.
They observe slower token generation than other users who run on similar hardware, and they suspect the much larger 200k context might be the main cause.
They also see repetition loops with default repetition penalty and wonder whether this behavior has been improved or patched in later oMLX versions.
The discussion is essentially a troubleshooting thread seeking guidance on correct oMLX/Opencode configuration (e.g., reasoning_parser choice and relevant runtime parameters).

Gemma 4 26B on oMLX with OpenCode, M4 Max, 64GB unified - am I doing something wrong/miscalibrated on capabilities here?

https://preview.redd.it/u5y6j3a1etug1.png?width=1668&format=png&auto=webp&s=5a1cefb7cbe71522fa9f9ce599ae09969ce90629

https://preview.redd.it/7j92jhc3etug1.png?width=682&format=png&auto=webp&s=e1edbc7c589359ab75abaab08cfe7a208789a0bc

So this might very well be user error on my end but please let me know if whatever I am doing is somehow wrong:

M4 Max (highest core count version), 64GB of unified memory
Using oMLX 0.3.5dev1 version for serving, gemma 4bit it 26-a4b (200k context)
Opencode harness for running the model - no custom instructions for now

Consistently I see the LLM not doing what it is said to do. For example - I have some here:

Don't see it thinking all the time. I have it as "high" variant in opencode which sets the thinkingBudget to 8092 tokens, and have "forced" it to do so within oMLX with the chat template, thinking budget, - but it does not always think. For some reason - it also stops after saying it will do a certain tool call but it does not. I don't know if this is a result of the qwen reasoning parser that I'm using or not? If anyone is using oMLX - let me know what reasoning_parser you are using.
Another random question I have is -- I'm seeing a lot of people run this on my hardware - that the token generation speeds are much higher - however they are using lesser context (I'm using 200k). Is that the reason or am I doing something else wrong here?
It goes into repetition loops. I am using default repetition penalty but sometimes its just bad (this was with oMLX v0.3.3 so maybe this has been patched in since) Screenshot for this also attached:

https://preview.redd.it/9eu29tuiftug1.png?width=1996&format=png&auto=webp&s=5c3b6d85be35fb8c087c878b3add29377d5ce048

(This is with filenames redacted - I asked opus to replay the gemma-4 conversation without having any sensitive filenames and shit lol)

So this has been my experience - let me know if I'm doing anything obviously wrong or whether this is a case where I just simply have to tone down my expectations. I know I can't have SOTA like expectations for model of this size but idk if I'm miscalibrated or not - But I think because a lot of hype with this Gemma 4 release - I thought it would be something that is able to call tools reliably vs my experience with some older models (GPT-OSS 20B/Qwen 3 Next/Qwen 3 coder models - the gpt 20b version used to do this "I'll call the tool" and would just stop - the qwen models were better)

So not sure whether this is a calibration problem/I don't have a proper system prompt that works well with this model on opencode/I have some settings that are wrong.

submitted by /u/DarthLoki79
[link] [comments]