Analysis of spilling MoE weights onto SSD: GLM-5 is surprisingly usable even with over 1/3rd of weights left on SSD, due to caching dynamics

Reddit r/LocalLLaMA / 4/12/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The article analyzes the feasibility of “spilling” Mixture-of-Experts (MoE) model weights from GPU/CPU memory onto SSD storage and finds GLM-5 can still function acceptably even when more than one-third of the weights reside on SSD.
  • It attributes the surprising usability primarily to caching dynamics, implying that repeated access patterns can mask much of the latency cost of SSD reads.
  • The discussion focuses on performance/operability implications for local or constrained environments where full in-memory weight residency is not possible.
  • It provides an empirical/technical angle on how storage hierarchy behavior (SSD vs faster tiers) affects MoE inference practicality rather than treating it as a purely theoretical limitation.