Built LazyMoE — run 120B LLMs on 8GB RAM with no GPU using lazy expert loading + TurboQuant

Reddit r/LocalLLaMA / 4/13/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The author describes a self-built setup (“LazyMoE”) that reportedly allows running 120B-parameter LLMs on an 8GB RAM laptop with no GPU by combining multiple memory- and compute-reduction techniques.
The approach uses lazy Mixture-of-Experts (MoE) expert loading so only needed experts are loaded at runtime, reducing peak memory usage.
It also applies TurboQuant KV compression to shrink the key-value cache and further fit inference within limited RAM.
SSD streaming is used to handle parts of the model/data that cannot reside fully in memory, enabling execution despite storage/RAM constraints.
The post shares a GitHub repository and invites feedback, positioning the work as a practical system for “too big” models on commodity hardware.

I'm a master's student in Germany and I got obsessed with one question:

can you run a model that's "too big" for your hardware?

After weeks of experimenting I combined three techniques — lazy MoE

expert loading, TurboQuant KV compression, and SSD streaming — into

a working system.

Here's what it looks like running on my Intel UHD 620 laptop with

8GB RAM and zero GPU...

Would love feedback from this community!