Fully on-device at 4bit with 256 experts.
It uses SSD streaming to the GPU of the experts in MoE models.
I saw the article from Dan Woods and decided to port the metal inference engine to ios, add a few optimization and build a basic app.
I'm currently generating the weights for the 379B model and will have that running next.
[link] [comments]
