Bonsai models are pure hype: Bonsai-8B is MUCH dumber than Gemma-4-E2B

Reddit r/LocalLLaMA / 4/17/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author compares Bonsai-8B against Gemma-4, using different quantization/embedding setups in llama.cpp and a PrismML-llama.cpp fork for Bonsai.
  • With comparable hardware footprint, they argue Bonsai-8B performs worse than Gemma-4, noting it is only modestly smaller in memory despite being much larger in parameter count.
  • The author suggests conventional quantization limits (not pushing small models beyond certain levels) may partly explain the perceived lack of quality in Bonsai-8B.
  • In an update, they test a ternary version of Bonsai-8B and report that its outputs are even more incorrect than the 1-bit variant, while also being larger in size than Gemma.
Bonsai models are pure hype: Bonsai-8B is MUCH dumber than Gemma-4-E2B

I'm using the https://github.com/PrismML-Eng/llama.cpp fork for Bonsai, regular llama.cpp for Gemma.

Without embedding parameters:
Gemma 4 has 2.3B at 4.8 bpw (Q4_K_M) = 1104 MB
Bonsai-8B has 6.95B at 1.125 bpw (Q1_0) = 782 MB (only 29% smaller)

I could've gone with a smaller quant of Gemma 4, it's just conventional wisdom to not push small models beyond Q4_K_M.

I might try their ternary model later, but I don't have much hope...

[UPDATE]

Tried the 1.58 bit/ternary model (https://huggingface.co/prism-ml/Ternary-Bonsai-8B-mlx-2bit), its answers were somehow even more wrong than the 1-bit one. 6.95B parameters at 2.125 bpw is 1477 MB, 33% LARGER than Gemma!

Tested in latest version of oMLX: https://i.imgur.com/NsNNwzj.png

submitted by /u/WeGoToMars7
[link] [comments]