PSA: Two env vars that stop your model server from eating all your RAM and getting OOM-killed

Reddit r/LocalLLaMA / 3/25/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • Model servers like Ollama, vLLM, TGI, and similar systems can gradually increase RSS over hours due to glibc heap fragmentation and not returning memory to the OS, leading to OOM kills.
  • The proposed mitigation is to set two environment variables before process startup: `MALLOC_MMAP_THRESHOLD_=65536` and `MALLOC_TRIM_THRESHOLD_=65536`.
  • The post reports that testing on 13 diffusion models cycling continuously resulted in stable memory usage (~1.2GB) indefinitely, versus OOM at 52GB after 17 hours before the change.
  • A benchmark repo and full data/script are provided to reproduce and validate the memory behavior and fix.
  • This is an operational RAM-stability tweak for AI inference/service deployments rather than a change to model architecture or frameworks themselves.

If you run Ollama, vLLM, TGI, or any custom model server that loads and unloads models, you've probably seen RSS creep up over hours until Linux kills the process.

It's not a Python leak. It's not PyTorch. It's glibc's heap allocator fragmenting and never returning pages to the OS.

Fix:

export MALLOC_MMAP_THRESHOLD_=65536

tsumexport MALLOC_TRIM_THRESHOLD_=65536

Set these before your process starts. That's it.

We tested this on 13 diffusion models cycling continuously. Before: OOM at 52GB after 17 hours. After: stable at ~1.2GB indefinitely.

Repo with full data + benchmark script: https://github.com/brjen/pytorch-memory-fix

submitted by /u/VikingDane73
[link] [comments]