Painfully slow local llama on 5090 and 192GB RAM

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • A user running a local Llama server (MiniMax-M2.5-UD-Q3_K_XL.gguf) on a high-memory setup reports extremely slow generation, with ~5–10 tokens per second even after the initial prompt.
  • Their current launch configuration includes very large context size (ctx-size 65536), high batch/ub settings, and n-gpu-layers set to -1 (intended to use all available GPU layers).
  • They seek guidance on an “optimal setup” to improve throughput and reduce latency for local inference.
  • An update clarifies that the initial Ollama command was incorrect, and they corrected integration by setting ANTHROPIC_BASE_URL to the local Llama server and launching Claude normally to point to that backend.
  • The thread implicitly focuses on practical performance tuning for local LLM serving, particularly around context length, batching, and server/API integration.

I am running a llama server with the following command:
nohup ./llama-server \
--model "/path/to/your/models/MiniMax-M2.5-UD-Q3_K_XL.gguf" \
--alias "minimax_m2.5" \
--threads $(nproc) \
--threads-batch $(nproc) \
--n-gpu-layers -1 \
--port 8001 \
--ctx-size 65536 \
-b 4096 -ub 4096 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
> llama-server.log 2>&1 &
----------

and then
ollama launch claude --model frob/minimax-m2.5

----------
i wait more than 10 minutes for the first answer to come back when I give it a first prompt, subsequent prompts remain similarly slow.
tokens per second is around 5-10

Any guide to an optimal setup would be appreciated!

UPDATE: my bad on the ollama thing, that's not what i am running. so i set the anthropic base url and launch claude normally to point to llama server. this is a guide from the unsloth doc
export ANTHROPIC_BASE_URL="http://localhost:8001"

submitted by /u/RVxAgUn
[link] [comments]