Bonsai models are pure hype: Bonsai-8B is MUCH dumber than Gemma-4-E2B

Reddit r/LocalLLaMA / 4/17/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author compares Bonsai-8B against Gemma-4, using different quantization/embedding setups in llama.cpp and a PrismML-llama.cpp fork for Bonsai.
With comparable hardware footprint, they argue Bonsai-8B performs worse than Gemma-4, noting it is only modestly smaller in memory despite being much larger in parameter count.
The author suggests conventional quantization limits (not pushing small models beyond certain levels) may partly explain the perceived lack of quality in Bonsai-8B.
In an update, they test a ternary version of Bonsai-8B and report that its outputs are even more incorrect than the 1-bit variant, while also being larger in size than Gemma.

I'm using the https://github.com/PrismML-Eng/llama.cpp fork for Bonsai, regular llama.cpp for Gemma.

Without embedding parameters:
Gemma 4 has 2.3B at 4.8 bpw (Q4_K_M) = 1104 MB
Bonsai-8B has 6.95B at 1.125 bpw (Q1_0) = 782 MB (only 29% smaller)

I could've gone with a smaller quant of Gemma 4, it's just conventional wisdom to not push small models beyond Q4_K_M.

I might try their ternary model later, but I don't have much hope...

Tried the 1.58 bit/ternary model (https://huggingface.co/prism-ml/Ternary-Bonsai-8B-mlx-2bit), its answers were somehow even more wrong than the 1-bit one. 6.95B parameters at 2.125 bpw is 1477 MB, 33% LARGER than Gemma!

Tested in latest version of oMLX: https://i.imgur.com/NsNNwzj.png