Key Points

The note discusses “streaming experts” for Mixture-of-Experts LLMs, where expert weights are streamed from SSD per token to run models that otherwise wouldn’t fit in available RAM.
It highlights rapid progress: Qwen3.5-397B-A17B running in 48GB of RAM is followed by reports of running a 1T-parameter Kimi K2.5 with 32B active weights in 96GB RAM on an M2 Max MacBook Pro.
The author also points to demonstrations of running the same Qwen3.5-397B-A17B model on an iPhone, though at low throughput (~0.6 tokens/second).
The post argues the approach is promising (“has legs”) and expects continued optimization via ongoing “autoresearch loops” to improve performance further.

Simon Willison’s Weblog

Sponsored by: WorkOS — The infrastructure fast-growing B2B companies use to sell to Enterprise.

24th March 2026

I wrote about Dan Woods' experiments with streaming experts the other day, the trick where you run larger Mixture-of-Experts models on hardware that doesn't have enough RAM to fit the entire model by instead streaming the necessary expert weights from SSD for each token that you process.

Five days ago Dan was running Qwen3.5-397B-A17B in 48GB of RAM. Today @seikixtc reported running the colossal Kimi K2.5 - a 1 trillion parameter model with 32B active weights at any one time, in 96GB of RAM on an M2 Max MacBook Pro.

And @anemll showed that same Qwen3.5-397B-A17B model running on an iPhone, albeit at just 0.6 tokens/second - iOS repo here.

I think this technique has legs. Dan and his fellow tinkerers are continuing to run autoresearch loops in order to find yet more optimizations to squeeze more performance out of these models.

Posted 24th March 2026 at 5:09 am

Recent articles

Experimenting with Starlette 1.0 with Claude skills - 22nd March 2026
Profiling Hacker News users based on their comments - 21st March 2026
Thoughts on OpenAI acquiring Astral and uv/ruff/ty - 19th March 2026

This is a note by Simon Willison, posted on 24th March 2026.

definitions 50 ai 1927 generative-ai 1709 local-llms 148 llms 1675 qwen 52 kimi 9 autoresearch 3

Monthly briefing

Sponsor me for $10/month and get a curated email digest of the month's most important LLM developments.

Pay me to send you less!

Sponsor & subscribe

Streaming experts

Key Points

Simon Willison’s Weblog

Recent articles

Monthly briefing

Related Articles

Do I need different approaches for different types of business information errors?

WordPress Theme Customization Without Code: The AI Revolution

How AI-Powered Revenue Intelligence Transforms B2B Sales Teams

Why Your SaaS Needs AI Chat in 2026 (Add It in 40 Lines)

[D] Matryoshka Representation Learning

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Related Articles

Do I need different approaches for different types of business information errors?
Dev.to

WordPress Theme Customization Without Code: The AI Revolution
Dev.to

How AI-Powered Revenue Intelligence Transforms B2B Sales Teams
Dev.to

Why Your SaaS Needs AI Chat in 2026 (Add It in 40 Lines)
Dev.to

[D] Matryoshka Representation Learning
Reddit r/MachineLearning