AI Navigate

I checked Strix Halo (Ryzen ai max+ 395) performance test as context length increases

Reddit r/LocalLLaMA / 3/22/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • A benchmark project was created to measure how local llama.cpp models' decode throughput and prefill throughput change as prompt context grows on an AMD Strix Halo 128GB system.
  • The goal is to identify context lengths where decode speed drops and to enable easier cross-model comparisons on the same hardware.
  • The tool fetches models from a local llama.cpp server, lets users run multiple models, benchmarks across increasing context buckets, and writes results to CSV with a dashboard to browse results.
  • The test setup uses AMD Ryzen AI Max+ 395, Radeon 8060S, 128GB RAM, and ROCm 7.2 in a distrobox environment.
  • The author invites feedback on benchmark methodology, additional metrics, tuning ideas for Strix Halo/ROCm, and dashboard improvements, and may share results separately.

Hi all,

I saw a lot of test videos and postings for how exactly good Strix Halo machine(GTR9 PRO) is for Local LLM as long context length.

So I put together a small benchmark project for testing how local llama.cpp models behave as context length increases on an AMD Strix Halo 128GB machine.

Benchmark results Site
https://bluepaun.github.io/amd-strix-halo-context-bench/index.html?lang=en

Repo:

https://github.com/bluepaun/amd-strix-halo-context-bench

The main goal was pretty simple:

• measure decode throughput and prefill throughput

• see how performance changes as prompt context grows

• find the point where decode speed drops below 10 tok/sec

• make it easier to compare multiple local models on the same machine

What it does:

• fetches models from a local llama.cpp server

• lets you select one or more models in a terminal UI

• benchmarks them across increasing context buckets

• writes results incrementally to CSV

• includes a small GitHub Pages dashboard for browsing results

Test platform used for this repo:

AMD Ryzen AI Max+ 395

AMD Radeon 8060S

128GB system memory

• Strix Halo setup based on a ROCm 7.2 distrobox environment

I made this because I wanted something more practical than a single “max context” number.

On this kind of system, what really matters is:

• how usable throughput changes at 10K / 20K / 40K / 80K / 100K+

• how fast prefill drops

• where long-context inference stops feeling interactive

If you’re also testing Strix Halo, Ryzen AI Max+ 395, or other large-memory local inference setups, I’d be very interested in comparisons or suggestions.

Feedback welcome — especially on:

• better benchmark methodology

• useful extra metrics to record

• Strix Halo / ROCm tuning ideas

• dashboard improvements

If there’s interest, I can also post some benchmark results separately.

submitted by /u/Far-Jellyfish7794
[link] [comments]