Gemma 4 26B on oMLX with OpenCode, M4 Max, 64GB unified - am I doing something wrong/miscalibrated on capabilities here?

Reddit r/LocalLLaMA / 4/13/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A Reddit user asks whether they are misconfiguring or misunderstanding capabilities when running Gemma 4 26B (4-bit, 200k context) on oMLX 0.3.5dev1 using an OpenCode harness on an M4 Max (64GB unified memory) setup.
  • They report that the model sometimes fails to “think” despite a high thinkingBudget setting and may stop after announcing a tool call without executing it, raising questions about the reasoning parser and chat-template/tool-call handling.
  • They observe slower token generation than other users who run on similar hardware, and they suspect the much larger 200k context might be the main cause.
  • They also see repetition loops with default repetition penalty and wonder whether this behavior has been improved or patched in later oMLX versions.
  • The discussion is essentially a troubleshooting thread seeking guidance on correct oMLX/Opencode configuration (e.g., reasoning_parser choice and relevant runtime parameters).
Gemma 4 26B on oMLX with OpenCode, M4 Max, 64GB unified - am I doing something wrong/miscalibrated on capabilities here?

https://preview.redd.it/u5y6j3a1etug1.png?width=1668&format=png&auto=webp&s=5a1cefb7cbe71522fa9f9ce599ae09969ce90629

https://preview.redd.it/7j92jhc3etug1.png?width=682&format=png&auto=webp&s=e1edbc7c589359ab75abaab08cfe7a208789a0bc

So this might very well be user error on my end but please let me know if whatever I am doing is somehow wrong:

  • M4 Max (highest core count version), 64GB of unified memory
  • Using oMLX 0.3.5dev1 version for serving, gemma 4bit it 26-a4b (200k context)
  • Opencode harness for running the model - no custom instructions for now

Consistently I see the LLM not doing what it is said to do. For example - I have some here:

  • Don't see it thinking all the time. I have it as "high" variant in opencode which sets the thinkingBudget to 8092 tokens, and have "forced" it to do so within oMLX with the chat template, thinking budget, - but it does not always think. For some reason - it also stops after saying it will do a certain tool call but it does not. I don't know if this is a result of the qwen reasoning parser that I'm using or not? If anyone is using oMLX - let me know what reasoning_parser you are using.
  • Another random question I have is -- I'm seeing a lot of people run this on my hardware - that the token generation speeds are much higher - however they are using lesser context (I'm using 200k). Is that the reason or am I doing something else wrong here?
  • It goes into repetition loops. I am using default repetition penalty but sometimes its just bad (this was with oMLX v0.3.3 so maybe this has been patched in since) Screenshot for this also attached:

https://preview.redd.it/9eu29tuiftug1.png?width=1996&format=png&auto=webp&s=5c3b6d85be35fb8c087c878b3add29377d5ce048

(This is with filenames redacted - I asked opus to replay the gemma-4 conversation without having any sensitive filenames and shit lol)

So this has been my experience - let me know if I'm doing anything obviously wrong or whether this is a case where I just simply have to tone down my expectations. I know I can't have SOTA like expectations for model of this size but idk if I'm miscalibrated or not - But I think because a lot of hype with this Gemma 4 release - I thought it would be something that is able to call tools reliably vs my experience with some older models (GPT-OSS 20B/Qwen 3 Next/Qwen 3 coder models - the gpt 20b version used to do this "I'll call the tool" and would just stop - the qwen models were better)

So not sure whether this is a calibration problem/I don't have a proper system prompt that works well with this model on opencode/I have some settings that are wrong.

submitted by /u/DarthLoki79
[link] [comments]