Running 1bit Bonsai 8B on 2GB VRAM (MX150 mobile GPU)

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionTools & Practical UsageModels & Research

共有:

Key Points

The author reports successfully running the 1-bit Bonsai 8B LLM on an older 2018 laptop using an NVIDIA MX150 mobile GPU with only 2GB VRAM via a llama.cpp (PrismML) fork compiled with CUDA support.
They tune llama-server CLI settings to fit within VRAM limits—disabling the -fit option, using q8_0 quantized KV cache, and setting -np 1—then benchmark with warmup and ~1000-token prompt runs.
Maximum stable context size is tightly constrained by GPU memory and stability, reaching 5632 tokens with ubatch=512, while larger ubatch values reduce usable context (e.g., only 1024 tokens at ubatch=1024).
Throughput is limited and thermally constrained: generation token rates start around 7–9 tps but drop 30–40% once the GPU heats up and begins thermal throttling, with PP around ~48–52 tps depending on ubatch.
Power draw during benchmarks is about 45–50W system-wide, translating to roughly 6 Joules per generated token at ~8 tps, indicating poor energy efficiency for this workload.

I have an older laptop from ~2018, an Asus Zenbook UX430U. It was quite powerful in its time, with an i7-8550U CPU @ 1.80GHz (4 physical cores and an Intel iGPU), 16GB RAM and an additional NVIDIA MX150 GPU with 2GB VRAM. I think the GPU was intended for CAD applications, Photoshop filters or such - it is definitely not a gaming laptop. I'm using Linux Mint with the Cinnamon desktop using the iGPU only, leaving the MX150 free for other uses.

I never thought I would run LLMs on this machine, though I've occasionally used the MX150 GPU to train small PyTorch or TensorFlow models; it is maybe 3 times faster than using just the CPU. However, when the 1-bit Bonsai 8B model was released, I couldn't resist trying out if I could run it on this GPU.

So I took the llama.cpp fork from PrismML, compiled it with CUDA support and played around. I soon decided to turn off the -fit option because with such tight VRAM it's not very helpful. Instead I just optimized the CLI parameters manually. I chose to use q8_0 quantized KV cache and -np 1 to save a bit of VRAM. I couldn't get llama-bench to cooperate, so I just used llama-server. My test procedure was to start llama-server and send off a small warmup query followed by a benchmark query which has an approximately 1000 token prompt. Accurate benchmarking was difficult, because the GPU quickly heats up to around 80C and starts thermal throttling, which cuts the performance by 30-40%. I let the machine cool a little between runs, tried a few times and reported the highest numbers.

With the default ubatch size 512, the maximum context I could fit without crashing was 5632. I get 52 tps on PP. TG starts off with 9 tps but quickly falls to around 7-8 or even less if the GPU heats up too much.

Here is my llama-server command: llama-server -m Bonsai-8B.gguf -ctk q8_0 -ctv q8_0 -np 1 -fit off -ub 512 -c 5632

I also tried other ubatch sizes and optimized the maximum context I could fit. Here is a summary:

ubatch ctx pp tg comments 1024 1024 54 9 Only generated a few tokens before running out of context. 512 5632 52 8 256 7680 48 8 128 8704 41 8

It looks like the PP speed is not very much affected by the ubatch size, at least for values of 256 and above. The sweet spot for ubatch, if you can call it that, is around 256-512. TG speed is always around 8 tps before thermal throttling starts to kick in. With an ubatch size of 1024, the maximum context length is 1024, which is pretty useless.

With the laptop battery fully charged, I also measured power draw from the outlet while running the benchmarks: it was around 45-50W. This includes power usage by the GPU, CPU, display and everything else on the machine. So with a TG speed of 8 tps, the energy usage was around 6 Joules per token. That's not particularly efficient.

Does this make any sense? I don't think so. It's kind of cool that you can run a 8B parameter LLM on just 2GB VRAM, but at least this MX150 GPU is not suitable for LLM inference. I can't think of any good reason to use it beyond "it's possible so let's do it". With this kind of speeds you are probably better off just using the CPU alone; as a bonus, you can probably fit a much longer context into system RAM.

This was my first post on r/LocalLLaMA. I hope you enjoyed it. No AIs were hurt, or even consulted, while writing this post.

submitted by /u/OsmanthusBloom
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/4DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

Claude Code’s Source Leaks, OpenAI Exits Video Generation, Gemini Adds Music Generation, LLMs Learn at Inference

The Batch

MCP Observability: Logging, Auditing, and Debugging Agent-Server Interactions in Production

Dev.to

Why OpenClaw Agents Lose Their Minds Mid-Session (And What It Takes to Fix It)

Dev.to

Running 1bit Bonsai 8B on 2GB VRAM (MX150 mobile GPU)

Key Points

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

Claude Code’s Source Leaks, OpenAI Exits Video Generation, Gemini Adds Music Generation, LLMs Learn at Inference

MCP Observability: Logging, Auditing, and Debugging Agent-Server Interactions in Production

Why OpenClaw Agents Lose Their Minds Mid-Session (And What It Takes to Fix It)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer