Llama-serve.cp on default behavior may allocates 4x context size in order to serve multiple clients, if you are a single user on a system with little VRAM you know that the bigger the context length -> smaller LM in VRAM -> reduced speed.
So launch with llama-server -np1 , maybe add --fit-target 126
On my 12GB GPU with 60k context I got ~20% more TPS.
One more: if you use Firefox (or others) disable hw acceleration:
- Go to Settings > General > Performance.
- Uncheck "Use recommended performance settings".
- Uncheck "Use hardware acceleration when available".
- Restart Firefox.
Firefox uses and reserves chunks of your VRAM for web pages, you may want to use all the resources you have for your LocalLM serving.
Dam now I'm serving Qwen3.5-35B-A3B-IQ2_S
at 90.94 tokens per second on a 6700xt, from original 66t/s.
[link] [comments]