Run Qwen3.5 flagship model with 397 billion parameters at 5 – 9 tok/s on a $2,100 desktop! Two $500 GPUs, 32GB RAM, one NVMe drive. Uses Q4_K_M quants

Reddit r/LocalLLaMA / 3/24/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The post introduces FOMOE (Fast Opportunistic Mixture Of Experts), aimed at making large MoE inference practical on consumer hardware by reducing costly random NVMe weight reads.
  • It proposes keeping the most common expert weights in GPU VRAM via a rolling cache, using GPU memory effectively while cutting NVMe access during inference (reported to drop to 28% with a warm start).
  • A dual-GPU “ping-pong” architecture is used to overlap weight loading with compute, achieving over 5 tokens/second in the described setup.
  • An experimental Cache-Aware Routing (CAR) feature further reduces NVMe reads to about 7% by routing to next-best experts already resident in VRAM/DRAM within a quality threshold.
  • The article claims ~9 tokens/second with only a ~3.5% perplexity drop on wikitext, and notes the implementation is a sizable C/HIP system (~15K lines) guided heavily by engineering work.
Run Qwen3.5 flagship model with 397 billion parameters at 5 – 9 tok/s on a $2,100 desktop! Two $500 GPUs, 32GB RAM, one NVMe drive. Uses Q4_K_M quants

Introducing FOMOE: Fast Opportunistic Mixture Of Experts (pronounced fomo).

The problem: Large Mixture of Experts (MoEs) need a lot of memory for weights (hundreds of GBs), which are typically stored in flash memory (eg NVMe). During inference, only a small fraction of these weights are needed, however you don't know which ones ahead of time. This makes inference completely impractical on consumer hardware since flash latencies are too high for random access patterns.

The solution: make most expert weight reads unnecessary.

First store the most common experts in GPU memory (VRAM) and keep an up-to-date rolling expert cache.

With a 60% VRAM hit rate with a warm start, NVMe reads drop to 28% (other 12% served from DRAM). Add a dual GPU ping-pong architecture to overlap weight loading and compute, and you're already over 5 tok/s!

Can we do better without collapsing model accuracy? The insight: if two experts score similarly, the model barely notices which one runs.

An experimental feature called Cache-Aware Routing (CAR) reduces NVMe reads down to 7% by picking the next-best scoring expert already in VRAM or DRAM cache, within an acceptable threshold.

This can get us to ~9 tok/s with only a 3.5% drop in perplexity measured on wikitext.

The whole system is ~15K lines of Claude-driven C/HIP (with heavy human guidance).

https://preview.redd.it/d1th0dsbkvqg1.jpg?width=1280&format=pjpg&auto=webp&s=6bb456c55a762fc4e57b4313c887b9a5fe6ae582

submitted by /u/Rare-Tadpole-8841
[link] [comments]