I checked Strix Halo (Ryzen ai max+ 395) performance test as context length increases

Reddit r/LocalLLaMA / 3/22/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

A benchmark project was created to measure how local llama.cpp models' decode throughput and prefill throughput change as prompt context grows on an AMD Strix Halo 128GB system.
The goal is to identify context lengths where decode speed drops and to enable easier cross-model comparisons on the same hardware.
The tool fetches models from a local llama.cpp server, lets users run multiple models, benchmarks across increasing context buckets, and writes results to CSV with a dashboard to browse results.
The test setup uses AMD Ryzen AI Max+ 395, Radeon 8060S, 128GB RAM, and ROCm 7.2 in a distrobox environment.
The author invites feedback on benchmark methodology, additional metrics, tuning ideas for Strix Halo/ROCm, and dashboard improvements, and may share results separately.

Hi all,

I saw a lot of test videos and postings for how exactly good Strix Halo machine(GTR9 PRO) is for Local LLM as long context length.

So I put together a small benchmark project for testing how local llama.cpp models behave as context length increases on an AMD Strix Halo 128GB machine.

Benchmark results Site
https://bluepaun.github.io/amd-strix-halo-context-bench/index.html?lang=en

Repo:

https://github.com/bluepaun/amd-strix-halo-context-bench

The main goal was pretty simple:

• measure decode throughput and prefill throughput

• see how performance changes as prompt context grows

• find the point where decode speed drops below 10 tok/sec

• make it easier to compare multiple local models on the same machine

What it does:

• fetches models from a local llama.cpp server

• lets you select one or more models in a terminal UI

• benchmarks them across increasing context buckets

• writes results incrementally to CSV

• includes a small GitHub Pages dashboard for browsing results

Test platform used for this repo:

• AMD Ryzen AI Max+ 395

• AMD Radeon 8060S

• 128GB system memory

• Strix Halo setup based on a ROCm 7.2 distrobox environment

I made this because I wanted something more practical than a single “max context” number.

On this kind of system, what really matters is:

• how usable throughput changes at 10K / 20K / 40K / 80K / 100K+

• how fast prefill drops

• where long-context inference stops feeling interactive

If you’re also testing Strix Halo, Ryzen AI Max+ 395, or other large-memory local inference setups, I’d be very interested in comparisons or suggestions.

Feedback welcome — especially on:

• better benchmark methodology

• useful extra metrics to record

• Strix Halo / ROCm tuning ideas

• dashboard improvements

If there’s interest, I can also post some benchmark results separately.

submitted by /u/Far-Jellyfish7794
[link] [comments]

State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.

Dev.to

I Built a Zombie Process Killer Because Claude Code Ate 14GB of My RAM

Dev.to

Data Augmentation Using GANs

Dev.to

Building Safety Guardrails for LLM Customer Service That Actually Work in Production

Dev.to

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)

Dev.to

I checked Strix Halo (Ryzen ai max+ 395) performance test as context length increases

Key Points

Related Articles

State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.

I Built a Zombie Process Killer Because Claude Code Ate 14GB of My RAM

Data Augmentation Using GANs

Building Safety Guardrails for LLM Customer Service That Actually Work in Production

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer