| Disclaimer: everything here runs locally on Pi5, no API calls/no egpu etc, source/image available below. This is the follow-up to my post about a week ago. Since then I've added an SSD, the official active cooler, switched to a custom ik_llama.cpp build, and got prompt caching working. The results are... significantly better. The demo is running byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF, specifically the Q3_K_S 2.66bpw quant. On a Pi 5 8GB with SSD, I'm getting 7-8 t/s at 16,384 context length. Huge thanks to u/PaMRxR for pointing me towards the ByteShape quants in the first place. On a 4 bit quant of the same model family you can expect 4-5t/s. The whole thing is packaged as a flashable headless Debian image called Potato OS. You flash it, plug in your Pi, and walk away. After boot there's a 5 minute timeout that automatically downloads Qwen3.5 2B with vision encoder (~1.8GB), so if you come back in 10 minutes and go to Full source: github.com/slomin/potato-os. Flashing instructions here. Still early days, no OTA updates yet (reflash to upgrade), and there will be bugs. I've tested it on Qwen3, 3VL and 3.5 family of models so far. But if you've got a Pi 5 gathering dust, give it a go and let me know what breaks. [link] [comments] |
Follow-up: Qwen3 30B a3b at 7-8 t/s on a Raspberry Pi 5 8GB (source included)
Reddit r/LocalLLaMA / 3/20/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical Usage
Key Points
- The update shows that Qwen3-30B-A3B-Instruct-2507-GGUF can run on a Raspberry Pi 5 with 8GB RAM and SSD at 7-8 t/s with a 16,384-context length using 4-bit quant (2.66 bpw).
- The setup is packaged as Potato OS, a flashable headless Debian image that boots to a system which automatically downloads Qwen3.5 2B with vision encoder (~1.8GB) and exposes an OpenAI-compatible API over the local network, plus a basic web chat for testing.
- Users can swap in other models by pasting a HuggingFace URL or uploading one over LAN via the web interface, enabling flexible offline inference.
- The project is still in early days with no OTA updates yet, and credits PaMRxR for the ByteShape quant lead; the full source is on GitHub at slomin/potato-os with flashing instructions.
Related Articles

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成
日経XTECH

Your AI generated code is "almost right", and that is actually WORSE than it being "wrong".
Dev.to

Lessons from Academic Plagiarism Tools for SaaS Product Development
Dev.to

Windsurf’s New Pricing Explained: Simpler AI Coding or Hidden Trade-Offs?
Dev.to

Building Production RAG Systems with PostgreSQL: Complete Implementation Guide
Dev.to