| I'm using the https://github.com/PrismML-Eng/llama.cpp fork for Bonsai, regular llama.cpp for Gemma. Without embedding parameters: I could've gone with a smaller quant of Gemma 4, it's just conventional wisdom to not push small models beyond Q4_K_M. I might try their ternary model later, but I don't have much hope... [UPDATE]Tried the 1.58 bit/ternary model (https://huggingface.co/prism-ml/Ternary-Bonsai-8B-mlx-2bit), its answers were somehow even more wrong than the 1-bit one. 6.95B parameters at 2.125 bpw is 1477 MB, 33% LARGER than Gemma! Tested in latest version of oMLX: https://i.imgur.com/NsNNwzj.png [link] [comments] |
Bonsai models are pure hype: Bonsai-8B is MUCH dumber than Gemma-4-E2B
Reddit r/LocalLLaMA / 4/17/2026
💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The author compares Bonsai-8B against Gemma-4, using different quantization/embedding setups in llama.cpp and a PrismML-llama.cpp fork for Bonsai.
- With comparable hardware footprint, they argue Bonsai-8B performs worse than Gemma-4, noting it is only modestly smaller in memory despite being much larger in parameter count.
- The author suggests conventional quantization limits (not pushing small models beyond certain levels) may partly explain the perceived lack of quality in Bonsai-8B.
- In an update, they test a ternary version of Bonsai-8B and report that its outputs are even more incorrect than the 1-bit variant, while also being larger in size than Gemma.



