| Hey all, this is an update! A few days ago I posted to show the performance of a Raspberry Pi5 when using a SSD to let larger models run. Rightfully so, a few brought to my attention that the PCIe is faster than the USB3 connection I was using, so I bought the official HAT. Spoiler: As expected: Read speed doubled, leading to 1.5x to 2x improvement on tokens/sec for inference and text generation on models in swap. I'll repeat my setup shortly:
My focus is on the question: By default the Pi uses the PCIe interface with the Gen2 standard (so I only got ~418MB/sec read speed from the SSD when using the HAT). I appended Read speed of the SSD increased from 360.18MB/sec (USB) by a factor of 2.2x to what seems to be the maximum others achieved too with the HAT. My SSD is partitioned to be half swapspace, half partition where I store my models (but that could be also anywhere else). Models that fit in RAM don't need the swap of course. I benchmarked all models with this command, testing prompt processing (pp512) and text generation (tg128) at zero and (almost all) at 32k context: Here are the filtered results in alphabetical order (names adjusted as GLM4.7-Flash was mentioned as the underlying deepseek2 architecture for example):
build: 8c60b8a2b (8544) & b7ad48ebd (8661 because of gemma4 ) I'll put the full llama-bench output into the comments for completeness sake. The list includes Bonsai8B, for which I compiled the llama.cpp-fork and tested with that. Maybe I did something wrong, maybe the calculations aren't really optimized for ARM CPUs, I don't know. Not interested in looking into that model more, but I got asked to include. A few observations and remarks:
Take from my tests what you need. I'm happy to have this little potato and to experiment with it. Other models can be tested if there's demand. If you have any questions just comment or write me. :) Edit 2026-04-05: Added 32k-results for gpt-oss 120b [link] [comments] |
benchmarks of gemma4 and multiple others on Raspberry Pi5
Reddit r/LocalLLaMA / 4/6/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The author benchmarks Gemma4 and several other LLMs on a Raspberry Pi 5, comparing inference/text-generation performance when storage is attached via USB versus the official M.2 HAT (PCIe).
- Switching from USB3 to PCIe increases SSD read throughput by about 2.2x (roughly doubling read speed), which translates into an estimated 1.5x–2x improvement in tokens/sec when models are served with swap.
- The test setup uses a stock Raspberry Pi OS Lite (Trixie), an official active cooler, and a 1TB SSD with half swap and half model storage, while running different prompt-processing (pp512) and text-generation (tg128) workloads.
- The PCIe performance gain is achieved by adjusting the Pi’s PCIe generation setting (dtparam=pciex1_gen=3), raising SSD read rates close to the maximum reported by others using the same HAT.
- Benchmarks are run with llama.cpp’s llama-bench across model sizes and context lengths (including near-32k contexts) to show practical expectations with minimal hardware tinkering.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




