The Bonsai 1-bit models are very good

Reddit r/LocalLLaMA / 4/2/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author tests PrismML’s Bonsai 8B 1-bit model locally on an M4 Max MacBook Pro and reports strong performance across practical tasks like chat, document summarization, tool calling, and web search.
  • They note a key limitation: despite being distributed as GGUF, Bonsai 1-bit models can’t be loaded directly into standard llama.cpp and require PrismML’s fork that supports 1-bit operations.
  • The discussion highlights ongoing infrastructure progress in related llama.cpp code (including KV rotation merged upstream) and the author’s own upstream fork to incorporate 1-bit changes.
  • The author contrasts Bonsai favorably against prior “BitNet” 1-bit models from Microsoft, claiming those were largely unusable, while Bonsai is described as genuinely workable.
  • They emphasize the main benefit for local deployment—substantially lower memory pressure than comparable quantized models—while suggesting more 1-bit model series may follow as training effort becomes more feasible.
The Bonsai 1-bit models are very good

Hey everyone,

Tim from AnythingLLM and yesterday I saw the PrismML Bonsai post so i had to give it a real shot because 14x smaller models (in size and memory) would actually be a huge game changer for Local models - which is basically all I do.

I personally only ran the Bonsai 8B model for my tests, which are more practical that anything (chat, document summary, tool calling, web search, etc) so your milage may vary but I was running this on an M4 Max 48GB MacBook Pro and I wasnt even using the MLX model. I do want to see if I can get this running on my old Android S20 with the 1.7B model.

The only downside right now to this is you cannot just load this into llama.cpp directly even though it is a GGUF and instead need to use their fork of llama.cpp to support the operations for 1-bit.

That fork is really behind llama.cpp and ggerganov just merged in the KV rotation PR today, which is single part of TurboQuant but supposedly helps with KV accuracy at compression - so I made an upstream fork with 1-bit changes (no promises it works everywhere lol).

I can attest this model is not even on the same planet as the previously available MSFT BitNet models which we basically unusable and purely for research purposes.

I didnt even try to get this running on CUDA but I can confirm the memory pressure is indeed much lower compared to something of a similar size (Qwen3 VL 8B Instruct Q4_K_M) - I know that is not an apples to apples but just trying to give an idea.

Understandably news like this on April fools is not ideal, but its actually not a joke and we finally have a decent 1-bit model series! I am sure these are not easy to train up so maybe we will see others do it soon.

TBH, you would think news like this would shake a memory or GPU stock like TurboQuant did earlier this week but yet here we are with an actual real model that runs incredibly well with less resources out in the wild and like...crickets.

Anyway, lmk if y'all have tried this out yet and thoughts on it. I don't work with PrismML or even know anyone there, just thought it was cool.

submitted by /u/tcarambat
[link] [comments]