benchmarks of gemma4 and multiple others on Raspberry Pi5

Reddit r/LocalLLaMA / 4/6/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author benchmarks Gemma4 and several other LLMs on a Raspberry Pi 5, comparing inference/text-generation performance when storage is attached via USB versus the official M.2 HAT (PCIe).
Switching from USB3 to PCIe increases SSD read throughput by about 2.2x (roughly doubling read speed), which translates into an estimated 1.5x–2x improvement in tokens/sec when models are served with swap.
The test setup uses a stock Raspberry Pi OS Lite (Trixie), an official active cooler, and a 1TB SSD with half swap and half model storage, while running different prompt-processing (pp512) and text-generation (tg128) workloads.
The PCIe performance gain is achieved by adjusting the Pi’s PCIe generation setting (dtparam=pciex1_gen=3), raising SSD read rates close to the maximum reported by others using the same HAT.
Benchmarks are run with llama.cpp’s llama-bench across model sizes and context lengths (including near-32k contexts) to show practical expectations with minimal hardware tinkering.

benchmarks of gemma4 and multiple others on Raspberry Pi5

Hey all,

this is an update! A few days ago I posted to show the performance of a Raspberry Pi5 when using a SSD to let larger models run. Rightfully so, a few brought to my attention that the PCIe is faster than the USB3 connection I was using, so I bought the official HAT.

Spoiler: As expected: Read speed doubled, leading to 1.5x to 2x improvement on tokens/sec for inference and text generation on models in swap.

I'll repeat my setup shortly:

Raspberry Pi5 with 16GB RAM
Official Active Cooler
Official M.2 HAT+ Standard
1TB SSD connected via HAT
Running stock Raspberry Pi OS lite (Trixie)

My focus is on the question: What performance can I expect when buying a few standard components with only a little bit of tinkering? I know I can buy larger fans/coolers from third-party sellers, overclock and overvolt, buy more niche devices like an Orange Pi, but thats not what I wanted, so I went with a standard Pi and kept tinkering to a minimum, so that most can still do the same.

By default the Pi uses the PCIe interface with the Gen2 standard (so I only got ~418MB/sec read speed from the SSD when using the HAT). I appended dtparam=pciex1_gen=3 to the file "/boot/firmware/config.txt" and rebooted to use Gen3.

Read speed of the SSD increased from 360.18MB/sec (USB) by a factor of 2.2x to what seems to be the maximum others achieved too with the HAT.

$ sudo hdparm -t --direct /dev/nvme0n1p2 /dev/nvme0n1p2: Timing O_DIRECT disk reads: 2398 MB in 3.00 seconds = 798.72 MB/sec

My SSD is partitioned to be half swapspace, half partition where I store my models (but that could be also anywhere else). Models that fit in RAM don't need the swap of course.

I benchmarked all models with this command, testing prompt processing (pp512) and text generation (tg128) at zero and (almost all) at 32k context:

$ llama.cpp/build/bin/llama-bench -r 2 --mmap 0 -d 0,32768 -m <all-models-as-GGUF> --progress | tee bench.txt

Here are the filtered results in alphabetical order (names adjusted as GLM4.7-Flash was mentioned as the underlying deepseek2 architecture for example):

model	size	pp512	pp512 @ d32768	tg128	tg128 @ d32768
Bonsai 8B Q1_0	1.07 GiB	3.27	-	2.77	-
gemma3 12B-it Q8_0	11.64 GiB	12.88	3.34	1.00	0.66
gemma4 E2B-it Q8_0	4.69 GiB	41.76	12.64	4.52	2.50
gemma4 E4B-it Q8_0	7.62 GiB	22.16	9.44	2.28	1.53
gemma4 26B-A4B-it Q8_0	25.00 GiB	9.22	5.03	2.45	1.44
GLM-4.7-Flash 30B.A3B Q8_0	29.65 GiB	6.59	0.90	1.64	0.11
gpt-oss 20B IQ4_XS	11.39 GiB	9.13	2.71	4.77	1.36
gpt-oss 20B Q8_0	20.72 GiB	4.80	2.19	2.70	1.13
gpt-oss 120B Q8_0	59.02 GiB	5.11	1.77	1.95	0.79
kimi-linear 48B.A3B IQ1_M	10.17 GiB	8.67	2.78	4.24	0.58
mistral3 14B Q4_K_M	7.67 GiB	5.83	1.27	1.49	0.42
Qwen3-Coder 30B.A3B Q8_0	30.25 GiB	10.79	1.42	2.28	0.47
Qwen3.5 0.8B Q8_0	763.78 MiB	127.70	28.43	11.51	5.52
Qwen3.5 2B Q8_0	1.86 GiB	75.92	24.50	5.57	3.62
Qwen3.5 4B Q8_0	4.16 GiB	31.02	9.44	2.42	1.51
Qwen3.5 9B Q8_0	8.86 GiB	18.20	7.62	1.36	1.01
Qwen3.5 27B Q2_K_M	9.42 GiB	1.38	-	0.92	-
Qwen3.5 35B.A3B Q8_0	34.36 GiB	10.58	5.14	2.25	1.30
Qwen3.5 122B.A10B Q2_K_M	41.51 GiB	2.46	1.57	1.05	0.59
Qwen3.5 122B.A10B Q8_0	120.94 GiB	2.65	1.23	0.38	0.27

build: 8c60b8a2b (8544) & b7ad48ebd (8661 because of gemma4 )

I'll put the full llama-bench output into the comments for completeness sake.

The list includes Bonsai8B, for which I compiled the llama.cpp-fork and tested with that. Maybe I did something wrong, maybe the calculations aren't really optimized for ARM CPUs, I don't know. Not interested in looking into that model more, but I got asked to include.

A few observations and remarks:

CPU temperature was around ~75°C for small models that fit entirely in RAM
CPU temperature was around ~65°C for swapped models like Qwen3.5-35B.A3B.Q8_0 with load jumping between 50-100%
--> Thats +5 (RAM) and +15°C (swapped) in comparison to the earlier tests without the HAT, because of the now more restricted airflow and the higher CPU load
Another non-surprise: The more active parameters, the slower it gets, with dense models really suffering in speed (like Qwen3.5 27B).
I tried to compile ik_llama but failed because of code errors, so I couldn't test that and didn't have the time yet to make it work.

Take from my tests what you need. I'm happy to have this little potato and to experiment with it. Other models can be tested if there's demand.

If you have any questions just comment or write me. :)

Edit 2026-04-05: Added 32k-results for gpt-oss 120b

submitted by /u/honuvo
[link] [comments]