This guy, Dan Woods, used Karpathy's autoresearch and Apple's "LLM in a Flash" paper to evolve a harness that can run Qwen3.5 397B at 5.7 t/s on only 48GB RAM.
X.com article here, github repository and paper here.
He says the math suggests 18 t/s is possible on his hardware and that dense models that have a more predictable weight access pattern could get even faster.
[link] [comments]




