I am running a llama server with the following command:
nohup ./llama-server \
--model "/path/to/your/models/MiniMax-M2.5-UD-Q3_K_XL.gguf" \
--alias "minimax_m2.5" \
--threads $(nproc) \
--threads-batch $(nproc) \
--n-gpu-layers -1 \
--port 8001 \
--ctx-size 65536 \
-b 4096 -ub 4096 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
> llama-server.log 2>&1 &
----------
and then
ollama launch claude --model frob/minimax-m2.5
----------
i wait more than 10 minutes for the first answer to come back when I give it a first prompt, subsequent prompts remain similarly slow.
tokens per second is around 5-10
Any guide to an optimal setup would be appreciated!
UPDATE: my bad on the ollama thing, that's not what i am running. so i set the anthropic base url and launch claude normally to point to llama server. this is a guide from the unsloth doc
export ANTHROPIC_BASE_URL="http://localhost:8001"
[link] [comments]



