| Hello everyone. Over the last couple months I have been assembling my local AI setup for personal use, and I thought to write a post here, firstly to collect some thoughts on the whole concept, and secondly to perhaps gather some feedback. My setup is nowhere near as advanced as many professional rigs posted here, but I have the following specs: So far I have mainly been using it to run Qwen 3.6 27B at Q8 on the two cards together. I experimented around a little bit, but overall I landed on running my models using llama.cpp with Vulkan drivers. To get it out of the way, I am aware of the limitation of the connectivity in this system, especially for the 3rd GPU, which would run at a measly 4x gen 4 lanes. This is likely to be a significant bottleneck if I were to run a singular model distributed over all of my GPUs. I would love to eventually upgrade to something like a threadripper platform or use a PCIe fabric card to connect the GPUs more directly (something like LR-Link recently shown on the level1techs channel) but due to high costs it will have to wait. I am working on a hobby research project in the programming languages area, so generally access to some less common knowledge is very helpful. AFAIK there isn't really anything stronger at the moment than 27B to run for me locally at the moment. Eventually with 96GB of VRAM I could run something bigger but the PCI limitations would affect the overall performance in that scenario. Therefore I was considering potentially running 2/3 agents locally, with a smarter API overseer like K2.6 via API. For certain tasks which could be smaller in scope or where the lower speed would be acceptable, I could also consider running some CPU inference since I have a bunch of system RAM to utilize as well. Generally the idea I was considering was constructing some form of harness to allow me for semi-autonomous research and development in the scope of my project. Potential deployments could consist of a number of agentic developers/testers/thinkers running separately, for example with something like Q6 quants of 27B, so each could have its own GPU. Depending on the workload, it could be nice for the "overseer" to dynamically deploy necessary agents and models to fit the current workload (maybe for certain tasks we would want to put the development on pause and run a big model on all GPUs together, to benefit from larger knowledge). Because of the complex and specific nature of the project, it touches on more niche CS areas which the models like 27B have the awareness of, however they might not be well optimized for, so I think one key aspect would be allowing the agents to access the internet search and bigger cloud models when necessary. Overall, the most interesting part for me which I do not know too much about at the moment and would like to learn more about, is how to effectively engineer a harness to manage this hardware deployment and project. I could definitely spend some time just (vibe) coding something to fit my specific needs, however I do not think my setup, at least conceptually is anything new. I am aware there exist certain solutions like LangGraph and CrewAI, although I am unsure which would fit my use-case best, and be well extensible for my needs. I would be very curious to learn about other peoples experiences and thoughts on this hardware setup and potential deployments on it. If you read through all of that, thank you very much and sorry for the chaotic writing style. Cheers. [link] [comments] |
3xR9700 for semi-autonomous research and development - looking for setup/config ideas.
Reddit r/LocalLLaMA / 5/3/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage
Key Points
- A Reddit user describes a local AI workstation built around dual Radeon R9700 GPUs (and a limited third-GPU setup) to support semi-autonomous research and development for a programming-language hobby project.
- They report running Qwen 3.6 27B at Q8 using llama.cpp with Vulkan drivers, while noting PCIe connectivity limitations—especially for a third GPU—that can bottleneck distributed single-model workloads.
- They consider future upgrades (Threadripper platform or PCIe fabric cards like LR-Link) to improve GPU interconnect bandwidth, but expect to wait due to cost.
- To work around performance constraints, they propose an “agentic” approach: running multiple smaller quantized models (e.g., Q6 variants of ~27B) across GPUs, orchestrated by an API-based overseer (mentioned K2.6), and optionally using CPU inference for smaller tasks.
- The post is seeking ideas and configuration/setup feedback to build a harness that dynamically deploys agents and models based on workload needs.
Related Articles

Black Hat USA
AI Business

I used AI to moderate AI content — here's what I learned building AIHallucination
Dev.to
Stop Googling Prompts — Here's the Freelancer AI Toolkit That Actually Works
Dev.to
AI Powered Scheduling for Field Operations by Pablo M. Rivera
Dev.to
AI Deleted My Tests and Said 'All Tests Pass' — A Horror Story from Porting 'typia' from TypeScript to Go
Dev.to