Bonsai (PrismML's 1 bit version of Qwen3 8B 4B 1.7B) was not an aprils fools joke

Reddit r/LocalLLaMA / 4/3/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

A Reddit user reports testing PrismML’s “Bonsai” 1-bit quantized Qwen3 variants and concludes they appear to be real, not an April Fools joke.
Benchmarks on a single RTX 4060 show high throughput (about 107 t/s generation and >1114 t/s prompt processing) and substantially lower RAM usage than earlier Qwen 3.5 4B Q4 results.
The models are described as practical for summarization, performing “golden” on a text-summarization use case, though coding/tool use was not evaluated.
A key limitation is that the provided llama.cpp fork seems to hang or fail on a CPU-only mini PC, with an accompanying suspicion that there is no efficient CPU implementation and that dequantization/FP32 inference may be attempted.
The post argues that 1-bit representations could reduce both memory bandwidth and compute, potentially improving edge/CPU inference, and could benefit VRAM-constrained users.

I read the article yesterday:

And watched the only 3 videos that had surfaced about these bonsai models. Seemed legit but still maybe an aprils fools joke.

So today I woke up wanting to try them. I downloaded their 8B model, their llama.cpp fork, and tested it, and as far as I can see it's real:

On my humble 4060, 107 t/s generation and >1114 t/s prompt processing, with a model that's evidently tiny. For comparison, on qwen 3.5 4B Q4 I had gotten 56 t/s using the same prompts.

Most importantly, the RAM used us much much lower, so I can use an 8B model in my humble 8GB VRAM, or the smaller models with longer context.

Quality: I have a use case of summarizing text, and upon first inspection it worked well. I dont try coding nor tool using, but for summarization it is golden.

The only bad part is that while it worked well on my windows PC with CUDA, when I tried it on a GPU-less mini PC (to see potential edge performance), although the llama.cpp fork compiles, it does not work, it loads the model, and seems to start processing the prompt and seems to hang. I asked Claude to check their code and it tells me they have no CPU implementation, so it might be dequantizing to FP32 and attempting regular inference (which would be dead slow on CPU).

I think there should be potential for these 1 bit models not only to reduce bandwidth and memory requirements, but also compute requirements: the matrix multiplication part, on 1 bit matrixes, should be something like XOR operations, much faster than FPanything. As I understand, so even if scaling to FP16 is required after the XOR, still a huge amount of compute was saved, which should help CPU-only inference, and edge inference in general.

There's hope for us VRAM starved plebes after all !! (and hopefully this might help deflate ramageddon, and the AI datacenter bubble in general)

submitted by /u/TylerDurdenFan
[link] [comments]