benchmarks of gemma4 and multiple others on Raspberry Pi5

Reddit r/LocalLLaMA / 4/6/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author benchmarks Gemma4 and several other LLMs on a Raspberry Pi 5, comparing inference/text-generation performance when storage is attached via USB versus the official M.2 HAT (PCIe).
  • Switching from USB3 to PCIe increases SSD read throughput by about 2.2x (roughly doubling read speed), which translates into an estimated 1.5x–2x improvement in tokens/sec when models are served with swap.
  • The test setup uses a stock Raspberry Pi OS Lite (Trixie), an official active cooler, and a 1TB SSD with half swap and half model storage, while running different prompt-processing (pp512) and text-generation (tg128) workloads.
  • The PCIe performance gain is achieved by adjusting the Pi’s PCIe generation setting (dtparam=pciex1_gen=3), raising SSD read rates close to the maximum reported by others using the same HAT.
  • Benchmarks are run with llama.cpp’s llama-bench across model sizes and context lengths (including near-32k contexts) to show practical expectations with minimal hardware tinkering.
benchmarks of gemma4 and multiple others on Raspberry Pi5

Hey all,

this is an update! A few days ago I posted to show the performance of a Raspberry Pi5 when using a SSD to let larger models run. Rightfully so, a few brought to my attention that the PCIe is faster than the USB3 connection I was using, so I bought the official HAT.

Spoiler: As expected: Read speed doubled, leading to 1.5x to 2x improvement on tokens/sec for inference and text generation on models in swap.

I'll repeat my setup shortly:

  • Raspberry Pi5 with 16GB RAM
  • Official Active Cooler
  • Official M.2 HAT+ Standard
  • 1TB SSD connected via HAT
  • Running stock Raspberry Pi OS lite (Trixie)

My focus is on the question: What performance can I expect when buying a few standard components with only a little bit of tinkering? I know I can buy larger fans/coolers from third-party sellers, overclock and overvolt, buy more niche devices like an Orange Pi, but thats not what I wanted, so I went with a standard Pi and kept tinkering to a minimum, so that most can still do the same.

By default the Pi uses the PCIe interface with the Gen2 standard (so I only got ~418MB/sec read speed from the SSD when using the HAT). I appended dtparam=pciex1_gen=3 to the file "/boot/firmware/config.txt" and rebooted to use Gen3.

Read speed of the SSD increased from 360.18MB/sec (USB) by a factor of 2.2x to what seems to be the maximum others achieved too with the HAT.

$ sudo hdparm -t --direct /dev/nvme0n1p2 /dev/nvme0n1p2: Timing O_DIRECT disk reads: 2398 MB in 3.00 seconds = 798.72 MB/sec 

My SSD is partitioned to be half swapspace, half partition where I store my models (but that could be also anywhere else). Models that fit in RAM don't need the swap of course.

I benchmarked all models with this command, testing prompt processing (pp512) and text generation (tg128) at zero and (almost all) at 32k context:

$ llama.cpp/build/bin/llama-bench -r 2 --mmap 0 -d 0,32768 -m <all-models-as-GGUF> --progress | tee bench.txt 

Here are the filtered results in alphabetical order (names adjusted as GLM4.7-Flash was mentioned as the underlying deepseek2 architecture for example):

model size pp512 pp512 @ d32768 tg128 tg128 @ d32768
Bonsai 8B Q1_0 1.07 GiB 3.27 - 2.77 -
gemma3 12B-it Q8_0 11.64 GiB 12.88 3.34 1.00 0.66
gemma4 E2B-it Q8_0 4.69 GiB 41.76 12.64 4.52 2.50
gemma4 E4B-it Q8_0 7.62 GiB 22.16 9.44 2.28 1.53
gemma4 26B-A4B-it Q8_0 25.00 GiB 9.22 5.03 2.45 1.44
GLM-4.7-Flash 30B.A3B Q8_0 29.65 GiB 6.59 0.90 1.64 0.11
gpt-oss 20B IQ4_XS 11.39 GiB 9.13 2.71 4.77 1.36
gpt-oss 20B Q8_0 20.72 GiB 4.80 2.19 2.70 1.13
gpt-oss 120B Q8_0 59.02 GiB 5.11 1.77 1.95 0.79
kimi-linear 48B.A3B IQ1_M 10.17 GiB 8.67 2.78 4.24 0.58
mistral3 14B Q4_K_M 7.67 GiB 5.83 1.27 1.49 0.42
Qwen3-Coder 30B.A3B Q8_0 30.25 GiB 10.79 1.42 2.28 0.47
Qwen3.5 0.8B Q8_0 763.78 MiB 127.70 28.43 11.51 5.52
Qwen3.5 2B Q8_0 1.86 GiB 75.92 24.50 5.57 3.62
Qwen3.5 4B Q8_0 4.16 GiB 31.02 9.44 2.42 1.51
Qwen3.5 9B Q8_0 8.86 GiB 18.20 7.62 1.36 1.01
Qwen3.5 27B Q2_K_M 9.42 GiB 1.38 - 0.92 -
Qwen3.5 35B.A3B Q8_0 34.36 GiB 10.58 5.14 2.25 1.30
Qwen3.5 122B.A10B Q2_K_M 41.51 GiB 2.46 1.57 1.05 0.59
Qwen3.5 122B.A10B Q8_0 120.94 GiB 2.65 1.23 0.38 0.27

build: 8c60b8a2b (8544) & b7ad48ebd (8661 because of gemma4 )

I'll put the full llama-bench output into the comments for completeness sake.

The list includes Bonsai8B, for which I compiled the llama.cpp-fork and tested with that. Maybe I did something wrong, maybe the calculations aren't really optimized for ARM CPUs, I don't know. Not interested in looking into that model more, but I got asked to include.

A few observations and remarks:

  • CPU temperature was around ~75°C for small models that fit entirely in RAM
  • CPU temperature was around ~65°C for swapped models like Qwen3.5-35B.A3B.Q8_0 with load jumping between 50-100%
  • --> Thats +5 (RAM) and +15°C (swapped) in comparison to the earlier tests without the HAT, because of the now more restricted airflow and the higher CPU load
  • Another non-surprise: The more active parameters, the slower it gets, with dense models really suffering in speed (like Qwen3.5 27B).
  • I tried to compile ik_llama but failed because of code errors, so I couldn't test that and didn't have the time yet to make it work.

Take from my tests what you need. I'm happy to have this little potato and to experiment with it. Other models can be tested if there's demand.

If you have any questions just comment or write me. :)

Edit 2026-04-05: Added 32k-results for gpt-oss 120b

submitted by /u/honuvo
[link] [comments]