AI Navigate

I'm using llama.cpp to run models larger than my Mac's memory

Reddit r/LocalLLaMA / 3/22/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The post introduces Hypura, a method to optimize inference by distributing model tensors across GPU, RAM, and NVMe tiers based on access patterns and bandwidth costs, enabled by llama.cpp.
  • It is noted to work particularly well with MoE models, since not all experts need to be loaded into memory simultaneously, allowing offloading to NVMe when idle.
  • Hypura is fully open-source, with the Github repository provided for implementation and use.
  • The approach enables running models larger than a local Mac memory capacity by leveraging tiered storage and hardware resources.
I'm using llama.cpp to run models larger than my Mac's memory

Hey all,

Wanted to share something that I hope can help others. I found a way to optimize inference via llama.cpp specifically for running models that wouldn't typically be able to run locally due to memory shortages. It's called Hypura, and it places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities.

I've found it to work especially well with MoE models since not all experts need to be loaded into memory at the same time, enabling offloading others to NVMe when not in use.

Sharing the Github here. Completely OSS, and only possible because of llama.cpp: https://github.com/t8/hypura

https://preview.redd.it/rq873yiieiqg1.png?width=2164&format=png&auto=webp&s=d1b591d767ccef8838536c47c0a5e8711bf36aa9

submitted by /u/tbaumer22
[link] [comments]