Hi all,
Like many of you, I'm passionate about running local models efficiently. I've spent the recently designing a custom hardware architecture – an NPU Array (v1) – specifically optimized for matrix multiplication and high TOPS/Watt performance for local AI inference.
I've just open-sourced the entire repository here: https://github.com/n57d30top/graph-assist-npu-array-v1-direct-add-commit-add-hi-tap/tree/main
Disclaimer: This is early-stage, experimental hardware design. It’s not a finished chip you can plug into a PCIe slot tomorrow. I am currently working on resolving routing congestion to hit my target clock frequencies.
However, I believe the open-source community needs more open silicon designs to eventually break the hardware monopoly and make running 70B+ parameters locally cheap and power-efficient.
I’d love for the community to take a look, point out flaws, or jump in if you're interested in the intersection of hardware array design and LLM inference. All feedback is welcome!
[link] [comments]