I just ran Qwen3.5 35B on my iPhone at 5.6 tok/sec.

Reddit r/LocalLLaMA / 3/22/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

A Reddit user reports running Qwen3.5 35B completely on-device on iPhone using 4-bit quantization with 256 experts in an MoE setup.
The approach uses SSD streaming to the GPU of the experts, enabling efficient on-device inference for large models.
The author ported Apple's Metal inference engine to iOS, added optimizations, and built a basic app to showcase the capability.
They are in the process of preparing weights for a 379B model and expect to run it next, signaling ongoing scaling progress.
This demonstrates growing feasibility of large language models running locally on mobile devices, potentially reducing cloud dependence.

Fully on-device at 4bit with 256 experts.

It uses SSD streaming to the GPU of the experts in MoE models.

I saw the article from Dan Woods and decided to port the metal inference engine to ios, add a few optimization and build a basic app.

I'm currently generating the weights for the 379B model and will have that running next.

Reddit r/LocalLLaMA

Dev.to

Dev.to

Dev.to

Dev.to