Hi all,
I saw a lot of test videos and postings for how exactly good Strix Halo machine(GTR9 PRO) is for Local LLM as long context length.
So I put together a small benchmark project for testing how local llama.cpp models behave as context length increases on an AMD Strix Halo 128GB machine.
Benchmark results Site
https://bluepaun.github.io/amd-strix-halo-context-bench/index.html?lang=en
Repo:
https://github.com/bluepaun/amd-strix-halo-context-bench
The main goal was pretty simple:
• measure decode throughput and prefill throughput
• see how performance changes as prompt context grows
• find the point where decode speed drops below 10 tok/sec
• make it easier to compare multiple local models on the same machine
What it does:
• fetches models from a local llama.cpp server
• lets you select one or more models in a terminal UI
• benchmarks them across increasing context buckets
• writes results incrementally to CSV
• includes a small GitHub Pages dashboard for browsing results
Test platform used for this repo:
• AMD Ryzen AI Max+ 395
• AMD Radeon 8060S
• 128GB system memory
• Strix Halo setup based on a ROCm 7.2 distrobox environment
I made this because I wanted something more practical than a single “max context” number.
On this kind of system, what really matters is:
• how usable throughput changes at 10K / 20K / 40K / 80K / 100K+
• how fast prefill drops
• where long-context inference stops feeling interactive
If you’re also testing Strix Halo, Ryzen AI Max+ 395, or other large-memory local inference setups, I’d be very interested in comparisons or suggestions.
Feedback welcome — especially on:
• better benchmark methodology
• useful extra metrics to record
• Strix Halo / ROCm tuning ideas
• dashboard improvements
If there’s interest, I can also post some benchmark results separately.
[link] [comments]
