AI Navigate

I just ran Qwen3.5 35B on my iPhone at 5.6 tok/sec.

Reddit r/LocalLLaMA / 3/22/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A Reddit user reports running Qwen3.5 35B completely on-device on iPhone using 4-bit quantization with 256 experts in an MoE setup.
  • The approach uses SSD streaming to the GPU of the experts, enabling efficient on-device inference for large models.
  • The author ported Apple's Metal inference engine to iOS, added optimizations, and built a basic app to showcase the capability.
  • They are in the process of preparing weights for a 379B model and expect to run it next, signaling ongoing scaling progress.
  • This demonstrates growing feasibility of large language models running locally on mobile devices, potentially reducing cloud dependence.

Fully on-device at 4bit with 256 experts.

It uses SSD streaming to the GPU of the experts in MoE models.

I saw the article from Dan Woods and decided to port the metal inference engine to ios, add a few optimization and build a basic app.

I'm currently generating the weights for the 379B model and will have that running next.

submitted by /u/Alexintosh
[link] [comments]