| Gemma quant comparison on M5 Max MacBook Pro 128GB (subjective of course, but on variety of categories): the surprising bit: accuracy vs. tokens per second
24B-A4B failing some tests due to regression loops I configured " " I tested all of them yesterday before these template updates were made by Hugging Face, and they did perform slightly worse. The above it retested with these template updates included, so the updates did work. I think it would make sense to hold on to " context: this was prompted by the feedback in the reddit discussion, where I created a list to work on to address the feedback [link] [comments] |
Gemma 4 31B — 4bit is all you need
Reddit r/LocalLLaMA / 4/14/2026
💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The post reports a subjective benchmark of Gemma 4 31B quantized to 4-bit on a M5 Max 128GB MacBook Pro, comparing it against 8-bit and full precision (bf16) across multiple categories.
- In the author’s tests, the Gemma 4 31B 4-bit variant scored higher than the 8-bit variant (91.3% vs 88.4%), though the author notes the exact cause may be template/prompt/quantization effects.
- A key tradeoff observed is performance speed: the 31B 4-bit model runs at about 21 tokens/second, but delivers better results for the author than 31B 8-bit.
- For the smaller Gemma 4 26B-A4B model, the author encountered failure cases where some questions entered a “regression loop,” with responses truncated at max tokens (16,384), preventing the model from recovering.
- The overall takeaway is that 4-bit may be sufficient for strong quality, but the author suggests more rigorous testing is needed to identify where 4-bit begins to lose relative to full precision.


