Gemma 4 31B — 4bit is all you need

Reddit r/LocalLLaMA / 4/14/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Read original →

共有:

Key Points

The post reports a subjective benchmark of Gemma 4 31B quantized to 4-bit on a M5 Max 128GB MacBook Pro, comparing it against 8-bit and full precision (bf16) across multiple categories.
In the author’s tests, the Gemma 4 31B 4-bit variant scored higher than the 8-bit variant (91.3% vs 88.4%), though the author notes the exact cause may be template/prompt/quantization effects.
A key tradeoff observed is performance speed: the 31B 4-bit model runs at about 21 tokens/second, but delivers better results for the author than 31B 8-bit.
For the smaller Gemma 4 26B-A4B model, the author encountered failure cases where some questions entered a “regression loop,” with responses truncated at max tokens (16,384), preventing the model from recovering.
The overall takeaway is that 4-bit may be sufficient for strong quality, but the author suggests more rigorous testing is needed to identify where 4-bit begins to lose relative to full precision.

Gemma quant comparison on M5 Max MacBook Pro 128GB (subjective of course, but on variety of categories):

gemma 4 leaderboard

the surprising bit: Gemma 4 31B 4bit scored higher than 8bit. 91.3% vs 88.4%. not sure why: could be the template, could be quantization, could be my prompts. but it was consistent across runs

accuracy vs. tokens per second

category accuracy

"Gemma 4 26B-A4B would get a higher score but for two questions it went into the regression loop and never came back, all the quants as well as full precision (bf16):

24B-A4B failing some tests due to regression loops

I configured "16,384" max response tokens and it hit that max while looping:

$ grep WARN ~/.cupel/cupel.log 2026-04-13 19:00:25 WARNING llm response truncated (hit max_tokens=16384) model=gemma-4-26b-a4b-it-4bit elapsed=215.0s tokens=16384 2026-04-13 19:04:52 WARNING llm response truncated (hit max_tokens=16384) model=gemma-4-26b-a4b-it-4bit elapsed=214.5s tokens=16384 2026-04-13 19:21:42 WARNING llm response truncated (hit max_tokens=16384) model=gemma-4-26b-a4b-it-8bit elapsed=260.1s tokens=16384 2026-04-13 19:26:02 WARNING llm response truncated (hit max_tokens=16384) model=gemma-4-26b-a4b-it-8bit elapsed=260.5s tokens=16384 2026-04-13 19:45:52 WARNING llm response truncated (hit max_tokens=16384) model=gemma-4-26b-a4b-it-bf16 elapsed=349.2s tokens=16384 2026-04-13 19:51:40 WARNING llm response truncated (hit max_tokens=16384) model=gemma-4-26b-a4b-it-bf16 elapsed=348.0s tokens=16384

"Gemma 4 31B 4 bit" is really good. it is a little on a slow side (21 tokens / second). But, as I mentioned before, preforms much better (for me) than "Gemma 4 31B 8 bit". I might however need better tests to see where 4bit starts losing to the full precision "Gemma 4 31B bf16", because as it stand right now they are peers.

I tested all of them yesterday before these template updates were made by Hugging Face, and they did perform slightly worse. The above it retested with these template updates included, so the updates did work.

I think it would make sense to hold on to "Gemma 4 31B 4 bit" for overnight complex tasks that do not require quick responses, and 21 tokens / second might be enough speed to churn through a few such tasks, but for "day time" it might be a little slow on a MacBook and "Qwen 122B A10B 4 bit" is still the local king. Maybe once M5 Ultra comes out + a few months to get it :), it may change.

context: this was prompted by the feedback in the reddit discussion, where I created a list to work on to address the feedback

submitted by /u/tolitius
[link] [comments]