Mistral Medium 3.5 on AMD Strix Halo

Reddit r/LocalLLaMA / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Read original →

共有:

Key Points

A Reddit user reports that Mistral Medium 3.5 running on AMD Strix Halo is extremely slow, recommending they run the workload overnight.
In a test with an end-to-end prompt of 48k tokens plus 4k thinking tokens, the model reportedly took about 2 hours to complete.
The included llama-server logs show very long prompt-evaluation and generation times, resulting in low token throughput (roughly ~10 tokens/sec during prompt eval and ~2 tokens/sec during eval).
The user shared the exact llama-server invocation (including context length and GGUF model parameters) used for the benchmark.
Overall, the post highlights practical performance limitations for running this model locally on the stated AMD setup under heavy token budgets.

TLDR; it's slow as heck. Run overnight.

I asked it a question about codebase architecture.

For an end-to-end prompt of 48k tokens + 4k thinking tokens, it took about 2 hours.

llama-server -hf unsloth/Mistral-Medium-3. 5-128B-GGUF:UD-Q5_K_XL --temp 0.7 --host 0.0.0.0 --port 8080 -c 80000 -fa on -ngl 999 --no-context-shift -fit off --no-mmap -np 1 --mlock --cache-reuse 256 --chat-template-kwargs '{"reasoning_effort":"high"}' --no-mmproj May 03 13:27:09 llama-server[6051]: prompt eval time = 4955501.32 ms / 48349 tokens ( 102.49 ms per token, 9.76 tokens per second) May 03 13:27:09 llama-server[6051]: eval time = 2652689.61 ms / 5583 tokens ( 475.14 ms per token, 2.10 tokens per second)

submitted by /u/Zc5Gwu
[link] [comments]