What would you say is the minimum amount of tokens per second you would tolerate for your local agent workflows?
I have been trying pi.dev connected to a llama.cpp instance running Qwen3.6-27B-Q6_K_L with 200K context running on an RTX A6000. I get about 26 t/s and is surprisingly usable. About the same user experience I get with Claude Code connected to Anthropic. But I have just been fooling around with relative simple prompts so far. I'm trying out Brave search API.
[link] [comments]




