Mistral Medium 3.5 on AMD Strix Halo

Reddit r/LocalLLaMA / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • A Reddit user reports that Mistral Medium 3.5 running on AMD Strix Halo is extremely slow, recommending they run the workload overnight.
  • In a test with an end-to-end prompt of 48k tokens plus 4k thinking tokens, the model reportedly took about 2 hours to complete.
  • The included llama-server logs show very long prompt-evaluation and generation times, resulting in low token throughput (roughly ~10 tokens/sec during prompt eval and ~2 tokens/sec during eval).
  • The user shared the exact llama-server invocation (including context length and GGUF model parameters) used for the benchmark.
  • Overall, the post highlights practical performance limitations for running this model locally on the stated AMD setup under heavy token budgets.

TLDR; it's slow as heck. Run overnight.

I asked it a question about codebase architecture.

For an end-to-end prompt of 48k tokens + 4k thinking tokens, it took about 2 hours.

llama-server -hf unsloth/Mistral-Medium-3. 5-128B-GGUF:UD-Q5_K_XL --temp 0.7 --host 0.0.0.0 --port 8080 -c 80000 -fa on -ngl 999 --no-context-shift -fit off --no-mmap -np 1 --mlock --cache-reuse 256 --chat-template-kwargs '{"reasoning_effort":"high"}' --no-mmproj May 03 13:27:09 llama-server[6051]: prompt eval time = 4955501.32 ms / 48349 tokens ( 102.49 ms per token, 9.76 tokens per second) May 03 13:27:09 llama-server[6051]: eval time = 2652689.61 ms / 5583 tokens ( 475.14 ms per token, 2.10 tokens per second) 
submitted by /u/Zc5Gwu
[link] [comments]