AI Navigate

Follow-up: Qwen3 30B a3b at 7-8 t/s on a Raspberry Pi 5 8GB (source included)

Reddit r/LocalLLaMA / 3/20/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The update shows that Qwen3-30B-A3B-Instruct-2507-GGUF can run on a Raspberry Pi 5 with 8GB RAM and SSD at 7-8 t/s with a 16,384-context length using 4-bit quant (2.66 bpw).
  • The setup is packaged as Potato OS, a flashable headless Debian image that boots to a system which automatically downloads Qwen3.5 2B with vision encoder (~1.8GB) and exposes an OpenAI-compatible API over the local network, plus a basic web chat for testing.
  • Users can swap in other models by pasting a HuggingFace URL or uploading one over LAN via the web interface, enabling flexible offline inference.
  • The project is still in early days with no OTA updates yet, and credits PaMRxR for the ByteShape quant lead; the full source is on GitHub at slomin/potato-os with flashing instructions.
Follow-up: Qwen3 30B a3b at 7-8 t/s on a Raspberry Pi 5 8GB (source included)

Disclaimer: everything here runs locally on Pi5, no API calls/no egpu etc, source/image available below.

This is the follow-up to my post about a week ago. Since then I've added an SSD, the official active cooler, switched to a custom ik_llama.cpp build, and got prompt caching working. The results are... significantly better.

The demo is running byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF, specifically the Q3_K_S 2.66bpw quant. On a Pi 5 8GB with SSD, I'm getting 7-8 t/s at 16,384 context length. Huge thanks to u/PaMRxR for pointing me towards the ByteShape quants in the first place. On a 4 bit quant of the same model family you can expect 4-5t/s.

The whole thing is packaged as a flashable headless Debian image called Potato OS. You flash it, plug in your Pi, and walk away. After boot there's a 5 minute timeout that automatically downloads Qwen3.5 2B with vision encoder (~1.8GB), so if you come back in 10 minutes and go to http://potato.local it's ready to go. If you know what you're doing, you can get there as soon as it boots and pick a different model, paste a HuggingFace URL, or upload one over LAN through the web interface. It exposes an OpenAI-compatible API on your local network, and there's a basic web chat for testing, but the API is the real point, you can hit it from anything:

curl -sN http://potato.local/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages":[{"role":"user","content":"What is the capital of Serbia?"}],"max_tokens":16,"stream":true}' \ | grep -o '"content":"[^"]*"' | cut -d'"' -f4 | tr -d '\n'; echo 

Full source: github.com/slomin/potato-os. Flashing instructions here. Still early days, no OTA updates yet (reflash to upgrade), and there will be bugs. I've tested it on Qwen3, 3VL and 3.5 family of models so far. But if you've got a Pi 5 gathering dust, give it a go and let me know what breaks.

submitted by /u/jslominski
[link] [comments]