If you run Ollama, vLLM, TGI, or any custom model server that loads and unloads models, you've probably seen RSS creep up over hours until Linux kills the process.
It's not a Python leak. It's not PyTorch. It's glibc's heap allocator fragmenting and never returning pages to the OS.
Fix:
export MALLOC_MMAP_THRESHOLD_=65536
tsumexport MALLOC_TRIM_THRESHOLD_=65536
Set these before your process starts. That's it.
We tested this on 13 diffusion models cycling continuously. Before: OOM at 52GB after 17 hours. After: stable at ~1.2GB indefinitely.
Repo with full data + benchmark script: https://github.com/brjen/pytorch-memory-fix
[link] [comments]