Tips: remember to use -np 1 with llama-server as a single user

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The post explains that llama-server’s default behavior may allocate about 4× the context size to support multiple clients, which can hurt performance on low-VRAM systems.
  • It recommends running llama-server with the flag `-np 1` for single-user setups, optionally using `--fit-target 126` to better fit the model to available memory.
  • The author reports performance gains on a 12GB GPU (e.g., ~20% more TPS) after changing these launch parameters, attributing improvements to reduced VRAM overhead.
  • It also advises disabling browser hardware acceleration in Firefox to free VRAM from reserved chunks, potentially improving throughput for local LLM serving.
  • A final anecdote notes improved serving performance for a Qwen3.5-35B variant, reaching ~90.94 tokens/sec versus ~66 tokens/sec originally on a 6700XT.

Llama-serve.cp on default behavior may allocates 4x context size in order to serve multiple clients, if you are a single user on a system with little VRAM you know that the bigger the context length -> smaller LM in VRAM -> reduced speed.

So launch with llama-server -np1 , maybe add --fit-target 126
On my 12GB GPU with 60k context I got ~20% more TPS.

One more: if you use Firefox (or others) disable hw acceleration:

  • Go to Settings > General > Performance.
  • Uncheck "Use recommended performance settings".
  • Uncheck "Use hardware acceleration when available".
  • Restart Firefox.

Firefox uses and reserves chunks of your VRAM for web pages, you may want to use all the resources you have for your LocalLM serving.

Dam now I'm serving Qwen3.5-35B-A3B-IQ2_S
at 90.94 tokens per second on a 6700xt, from original 66t/s.

submitted by /u/ea_man
[link] [comments]