Raspberry Pi 5 上の Qwen 3.5 35B A3B の最新情報

Update on Qwen 3.5 35B A3B on Raspberry PI 5

Did some more work on my Raspberry Pi inference setup.

Modified llama.cpp (a mix of the OG repo, ik_llama, and some tweaks)
Experimented with different quants, params, etc.
Prompt caching (ik_llama has some issues on ARM, so it’s not 100% tweaked yet, but I’m getting there)

The demo above is running this specific quant: https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/blob/main/Qwen3.5-35B-A3B-UD-Q2_K_XL.gguf

Some numbers for what to expect now (all tests on 16k context, vision encoder enabled):

2-bit big-ish quants of Qwen3.5 35B A3B: 3.5 t/s on the 16GB Pi, 2.5-ish t/s on the SSD-enabled 8GB Pi. Prompt processing is around ~50s per 1k tokens.
Smaller 2-bit quants: up to 4.5 t/s, around 3-ish t/s on the SSD 8GB one
Qwen3.5 2B 4-bit: 8 t/s on both, which is pretty impressive actually
Qwen3.5 4B runs similarly to A3B

Let me know what you guys think. Also, if anyone has a Pi 5 and wants to try it and poke around, lemme know. I have some other tweaks I'm actively testing (for example asymmetric KV cache quantisation, have some really good boosts in prompt processing)

submitted by /u/jslominski
[link] [comments]

Update on Qwen 3.5 35B A3B on Raspberry PI 5

Key Points

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer