I read the article yesterday:
https://prismml.com/news/bonsai-8b
And watched the only 3 videos that had surfaced about these bonsai models. Seemed legit but still maybe an aprils fools joke.
So today I woke up wanting to try them. I downloaded their 8B model, their llama.cpp fork, and tested it, and as far as I can see it's real:
On my humble 4060, 107 t/s generation and >1114 t/s prompt processing, with a model that's evidently tiny. For comparison, on qwen 3.5 4B Q4 I had gotten 56 t/s using the same prompts.
Most importantly, the RAM used us much much lower, so I can use an 8B model in my humble 8GB VRAM, or the smaller models with longer context.
Quality: I have a use case of summarizing text, and upon first inspection it worked well. I dont try coding nor tool using, but for summarization it is golden.
The only bad part is that while it worked well on my windows PC with CUDA, when I tried it on a GPU-less mini PC (to see potential edge performance), although the llama.cpp fork compiles, it does not work, it loads the model, and seems to start processing the prompt and seems to hang. I asked Claude to check their code and it tells me they have no CPU implementation, so it might be dequantizing to FP32 and attempting regular inference (which would be dead slow on CPU).
I think there should be potential for these 1 bit models not only to reduce bandwidth and memory requirements, but also compute requirements: the matrix multiplication part, on 1 bit matrixes, should be something like XOR operations, much faster than FPanything. As I understand, so even if scaling to FP16 is required after the XOR, still a huge amount of compute was saved, which should help CPU-only inference, and edge inference in general.
There's hope for us VRAM starved plebes after all !! (and hopefully this might help deflate ramageddon, and the AI datacenter bubble in general)
[link] [comments]



