Painfully slow local llama on 5090 and 192GB RAM

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

A user running a local Llama server (MiniMax-M2.5-UD-Q3_K_XL.gguf) on a high-memory setup reports extremely slow generation, with ~5–10 tokens per second even after the initial prompt.
Their current launch configuration includes very large context size (ctx-size 65536), high batch/ub settings, and n-gpu-layers set to -1 (intended to use all available GPU layers).
They seek guidance on an “optimal setup” to improve throughput and reduce latency for local inference.
An update clarifies that the initial Ollama command was incorrect, and they corrected integration by setting ANTHROPIC_BASE_URL to the local Llama server and launching Claude normally to point to that backend.
The thread implicitly focuses on practical performance tuning for local LLM serving, particularly around context length, batching, and server/API integration.

I am running a llama server with the following command:
nohup ./llama-server \
--model "/path/to/your/models/MiniMax-M2.5-UD-Q3_K_XL.gguf" \
--alias "minimax_m2.5" \
--threads $(nproc) \
--threads-batch $(nproc) \
--n-gpu-layers -1 \
--port 8001 \
--ctx-size 65536 \
-b 4096 -ub 4096 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
> llama-server.log 2>&1 &
----------

and then
ollama launch claude --model frob/minimax-m2.5

----------
i wait more than 10 minutes for the first answer to come back when I give it a first prompt, subsequent prompts remain similarly slow.
tokens per second is around 5-10

Any guide to an optimal setup would be appreciated!

UPDATE: my bad on the ollama thing, that's not what i am running. so i set the anthropic base url and launch claude normally to point to llama server. this is a guide from the unsloth doc
export ANTHROPIC_BASE_URL="http://localhost:8001"

submitted by /u/RVxAgUn
[link] [comments]