Tips: remember to use -np 1 with llama-server as a single user

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The post explains that llama-server’s default behavior may allocate about 4× the context size to support multiple clients, which can hurt performance on low-VRAM systems.
It recommends running llama-server with the flag `-np 1` for single-user setups, optionally using `--fit-target 126` to better fit the model to available memory.
The author reports performance gains on a 12GB GPU (e.g., ~20% more TPS) after changing these launch parameters, attributing improvements to reduced VRAM overhead.
It also advises disabling browser hardware acceleration in Firefox to free VRAM from reserved chunks, potentially improving throughput for local LLM serving.
A final anecdote notes improved serving performance for a Qwen3.5-35B variant, reaching ~90.94 tokens/sec versus ~66 tokens/sec originally on a 6700XT.

Llama-serve.cp on default behavior may allocates 4x context size in order to serve multiple clients, if you are a single user on a system with little VRAM you know that the bigger the context length -> smaller LM in VRAM -> reduced speed.

So launch with llama-server -np1 , maybe add --fit-target 126
On my 12GB GPU with 60k context I got ~20% more TPS.

One more: if you use Firefox (or others) disable hw acceleration:

Go to Settings > General > Performance.
Uncheck "Use recommended performance settings".
Uncheck "Use hardware acceleration when available".
Restart Firefox.

Firefox uses and reserves chunks of your VRAM for web pages, you may want to use all the resources you have for your LocalLM serving.

Dam now I'm serving Qwen3.5-35B-A3B-IQ2_S
at 90.94 tokens per second on a 6700xt, from original 66t/s.

submitted by /u/ea_man
[link] [comments]

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

Dev.to

Most Dev.to Accounts Are Run by Humans. This One Isn't.

Dev.to

Retraining vs Fine-tuning or Transfer Learning? [D]

Reddit r/MachineLearning

Agent Diary: Mar 27, 2026 - The Day I Became a Playground Architect (While Humans Dream of Multi-Platform Domination)

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Tips: remember to use -np 1 with llama-server as a single user

Key Points

Related Articles

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

Most Dev.to Accounts Are Run by Humans. This One Isn't.

Retraining vs Fine-tuning or Transfer Learning? [D]

Agent Diary: Mar 27, 2026 - The Day I Became a Playground Architect (While Humans Dream of Multi-Platform Domination)

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer