Autoresearch on Qwen3.5-397B, 36 experiments to reach 20.34 tok/s on M5 Max, honest results

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • A week-long “autoresearch” using Dan Woods’ flash-moe methodology optimized Qwen3.5-397B on a MacBook Pro M5 Max (128GB), improving decode speed from 10.61 tok/s to 20.34 tok/s—about a 2x software-only gain.
  • The test environment streamed a 209GB model from SSD (model larger than RAM) and used Qwen3.5-397B-A17B with Q3-GGUF experts and specific decoding/prefill settings, achieving ~20+ tok/s once page cache warmed.
  • Across 36 systematically logged experiments with perplexity-based quality gating, the work showed a shifting bottleneck: initial SSD I/O limits, then GPU encoding overhead, then projection kernels.
  • Key performance wins came from enabling 16 IO threads with parallel, page-aligned SSD reads; using temporal expert prediction to exploit cross-token routing correlation and overlap I/O with compute; and selecting Q3-GGUF expert quantization as a speed/quality sweet spot.
  • The article reports that adding/using Q3-GGUF expert support (from the Anemll flash-moe fork) plus additional Metal-level inference optimizations were essential to reaching the reported throughput.

I spent the past week trying to push Qwen3.5-397B faster on my M5 Max 128GB. Dan Woods' (@danveloper) original baseline was 4.36 tok/s on M3 Max. On M5 Max the starting point was already 10.61 tok/s due to better hardware. My optimizations pushed it to 20.34 tok/s, roughly 2x through software alone, and 4.67x over Dan's original result.

Hardware: MacBook Pro M5 Max, 128GB unified memory, 40-Core GPU

Model config: Qwen3.5-397B-A17B, Q3-GGUF experts (Unsloth IQ3_XXS/IQ4_XS mixed precision), Q8_0 embedding, Q6_K LM head. Decode: 20.34 tok/s. Prefill: 5.52 tok/s. The model is 209GB on disk, 4x larger than the 128GB RAM — everything streams from SSD.

Screenshot of an actual run below. You can see individual tokens hitting 20+ tok/s once the page cache warms up!

Methodology: I used the autoresearch loop methodology originally developed by Dan Woods github.com/danveloper/flash-moe, running it with Claude Code (Anthropic) to systematically run and evaluate experiments on M5 Max. Each experiment was logged with its result before moving to the next, with automatic quality gating via perplexity threshold to catch regressions. Human-AI collaboration: I directed the research, provided the hardware, and made all scientific decisions. Claude Code implemented and benchmarked under my direction. This let me cover 36 experiments in a few days instead of weeks. Full paper PDF available in the repo.

Built on: Dan Woods' original flash-moe paper github.com/danveloper/flash-moe and Anemll's fork github.com/Anemll/flash-moe. A pure C/Metal inference engine for running Qwen3.5-397B via SSD streaming on Apple Silicon. The Anemll fork added Q3-GGUF expert support which was essential to these results. My work adds further Metal-level optimizations on top.

One thing that became clear during autoresearch: every time you break through one wall, another one appears. SSD I/O was the bottleneck, then GPU encoding overhead, then projection kernels. Classic shifting bottleneck problem.

What actually moved the needle:

  1. 16 IO threads + cache-io-split=4, instead of reading each expert weight file as one sequential chunk, split into 4 parallel page-aligned reads hitting different SSD channels simultaneously. Already built into the engine, just needed to enable it. +1.5 tok/s
  2. Temporal expert prediction, discovered 27% cross-token routing correlation, overlap SSD reads with GPU compute. +4.3 tok/s
  3. Q3-GGUF experts (Unsloth IQ3_XXS/IQ4_XS), smaller payload, and surprisingly Q3 turned out to be the sweet spot. Better perplexity than 4-bit (5.58 vs 5.62) while being 23% smaller. Unsloth is smart about which layers to compress more aggressively and which ones to leave at higher precision, so you don't lose as much quality as you'd expect from 3-bit. +2.3 tok/s
  4. CMD2 pre-encode, eliminate 30μs per-layer submission gap. +0.44 tok/s
  5. Fused Q/K/V projection kernel, read input vector once instead of three times (Metal GPU optimization). +0.76 tok/s
  6. CMD2 pre-encode extended to all full-attention layers. +0.47 tok/s

Note: gains are not perfectly additive since some optimizations interact with each other.

What failed (78% discard rate):

  • 1-bit QJL quantization, perplexity 5647, catastrophic
  • Ternary 2-bit, 84% weight sparsity, model collapsed
  • K=3 expert routing, quality collapse
  • Cross-layer prediction, 0% hit rate
  • NAX offloading, tile padding overhead cancelled gains
  • 2-bit MLX experts, faster in isolation but worse perplexity (5.71 vs 5.58) and no speed advantage once temporal prediction was applied to Q3

Honest limitations:

  • Single hardware platform, results may not generalize
  • Q3 quantization at this scale degrades noticeably on long-form generation. Unfortunately, output quality was acceptable for short tasks but produced artifacts on longer responses. Quality was evaluated via perplexity only, not standardized benchmarks like MMLU or GPQA
  • This is a speed research project, not a production quality claim

Future work: One surprising finding: Apple's Neural Engine (ANE) was completely idle the entire time, drawing 0W. That's 38 TOPS of compute sitting unused. The problem is MoE inference needs to decide which experts to activate dynamically, and ANE only works with static pre-compiled graphs. There may be an opportunity for batch prefill though. Full analysis in the paper.

Paper + release: https://github.com/iluvclubs/flash-moe/releases/tag/v1.0

X/Twitter: DrPhoto

Thanks for reading. Happy to answer questions.

If anyone has ideas for further optimizations I am all ears. The ANE opportunity in particular feels underexplored.

submitted by /u/Equivalent-Buy1706
[link] [comments]