I tested 7 local models on 22 real agent tasks using OpenClaw on a Raspberry Pi 5 with an RTX 3090 running Ollama.
Tasks included reading emails, scheduling meetings, creating tasks, detecting phishing, handling errors, and browser automation.
The winner by a massive margin: qwen3.5:27b-q4_K_M at 59.4%. The runner up (qwen3.5:35b) scored only 23.2%. Everything else was below 5%.
Biggest surprises:
The quantized 27B model beat the larger 35B version by 2.5x. A 30B model scored dead last at 1.6%. Medium thinking worked best. Too much thinking actually hurt performance. Zero models could complete browser automation. The main thing that separated winners from losers was whether the model could find and use command line tools.
[link] [comments]
