| Hey all. I'm pretty new to low-level GPU stuff. But for fun I wanted to see if i could make Expert Paralellism work on my Strix Halo nodes (Minisforum boxes, 128GB unfied memory each) that i'm running as part of my k8s cluster. I must admit i have been using AI heavily and asked many stupid questions along the way, but i'm quite happy with the progress and wanted to share it. Here is my dashboard on my workload running across my two machines: From here i plan to surgically go after the bottlenecks. I'm thinking about writing ROCm kernels directly for some parts where i feel ggml feel a bit limiting. Would love some guidence from someone who are more experienced in this field. Since my background is mostly webdev and typescript. Thanks :) [link] [comments] |
Tried to vibe coded expert parallelism on Strix Halo — running Qwen3.5 122B-A10B at 9.5 tok/s
Reddit r/LocalLLaMA / 3/23/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage
Key Points
- The author is attempting to enable Expert Parallelism on Strix Halo nodes within a Kubernetes cluster (two MinisForum boxes with 128GB unified memory each).
- They report running Qwen3.5 122B-A10B at about 9.5 tokens per second on this setup.
- Their plan is to identify bottlenecks and potentially write ROCm kernels to overcome perceived ggml limitations.
- The post invites guidance from more experienced practitioners and notes the author’s background in web development and TypeScript.




