Gemma4 26B A4B runs easily on 16GB Macs

Reddit r/LocalLLaMA / 4/5/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The post argues that running Gemma4 26B A4B on 16GB Macs is generally hard with GPU acceleration, but becomes feasible by using CPU-only execution with MoE and aggressive quantization.
It reports practical performance results on an M2 MacBook Pro, achieving roughly 6–10 tokens per second with 8–16K context by testing several 4–5 bit quants (noting Unsloth’s IQ4_NL as best).
Suggested setup steps include setting GPU layers to 0, disabling “keep model in memory,” and using a modest batch size, with optional KV cache quantization.
For LM Studio users, it provides a workaround to enable the model’s “thinking” feature by editing the Jinja prompt template (JINGA prompt template) and adjusting reasoning parsing start/end strings.
The overall takeaway is that, while not fast, the configuration makes the model usable for local, consumer-hardware workflows via CPU swapping experts and careful quantization choices.

Typically, models in the 26B-class range are difficult to run on 16GB macs because any GPU acceleration requires the accelerated layers to sit entirely within wired memory. It's possible with aggressive quants (2 bits, or maybe a very lightweight IQ3_XXS), but quality degrades significantly by doing so.

However, if run entirely on the CPU instead (which is much more feasible with MoE models), it's possible to run really good quants even when the models end up being larger than the entire available system RAM. There is some performance loss from swapping in and out experts, but I find that the performance loss is much less than I would have expected.

I was able to easily achieve 6-10 tps with a context window of 8-16K on my M2 Macbook Pro (tested using various 4 and 5 bit quants, Unsloth's IQ4_NL works best). Far from fast, but good enough to be perfectly usable for folks used to running on this kind of hardware.

Just set the number of GPU layers to 0, uncheck "keep model in memory", and set the batch size to 64 or something light. Everything else can be left at the default (KV cache quantization is optional, but Q8_0 might improve performance a little bit).

Thinking fix for LMStudio:

Also, for fellow LMstudio users, none of the currently published ones have thinking enabled by default, even though the model supports it. To enable it, you have to go into the model settings, and add the following line at the very top of the JINGA prompt template (under the inference tab).

{% set enable_thinking=true %}

Also change the reasoning parsing strings:

Start string: <|channel>thought

End string: <channel|>

(Credit for this @Guilty_Rooster_6708) - I didn't come up with this fix, I've linked to the post I got it from.

submitted by /u/FenderMoon
[link] [comments]