Hey everyone, I’ve been banging my head against the wall on this for a few weeks and could really use some architecture or MLOps advice.
I am building a unified Knowledge Graph / RAG service for a local coding agent. It runs in a single Docker container via FastAPI. Initially, it ran okay on Windows (WSL), but moving it to native Linux has exposed severe memory limit issues under stress tests.
Hardware Constraints:
• 8GB VRAM (Laptop GPU)
• ~16GB System RAM (Docker limits hit fast, usually only ~6GB free when models are loaded)
The Stack (The Models):
Embedding: nomic-ai/nomic-embed-text-v2-moe
Reranking: BAAI/bge-reranker-base
Classification: MoritzLaurer/ModernBERT-large-zeroshot-v2.0 (used to classify text pairs into 4 relations: dependency, expansion, contradiction, unrelated).
The Problem / The Nightmare:
Because I am feeding code chunks and natural text into these models, I cannot aggressively truncate the text. I need the models to process variable, long sequences.
Here is what I’ve run into:
• Latency vs. OOM: If I use torch.cuda.empty_cache() to keep the GPU clean, latency spikes to 18-20 seconds per request due to driver syncs. If I remove it, the GPU instantly OOMs when concurrent requests hit.
• System RAM Explosion (Linux Exit 137): Using the Hugging Face pipeline("zero-shot-classification") caused massive CPU RAM bloat. Without truncation, the pipeline generates massive combination matrices in memory before sending them to the GPU. The Linux kernel instantly kills the container.
• VRAM Spikes: cudnn.benchmark = True was caching workspaces for every unique sequence length, draining my 3GB of free VRAM in seconds during stress tests.
Current "Band-Aid" Implementation:
Right now, I have a pure Python/FastAPI setup. I bypassed the HF pipeline and wrote a manual NLI inference loop for ModernBERT. I am using asyncio.Lock() to force serial execution (only one model touches the GPU at a time) and using deterministic deallocation (del inputs + gc.collect()) via FastAPI background tasks.
It's better, but still unstable under a 3-minute stress test.
My Questions for the Community:
Model Alternatives: Are there smaller/faster models that maintain high accuracy for Zero-Shot NLI and Reranking that fit better in an 8GB envelope?
Prebuilt Architectures: I previously looked at infinity_emb but struggled to integrate my custom 4-way NLI classification logic into its wrapper without double-loading models. Should I be looking at TEI (Text Generation Inference), TensorRT, or something else optimized for Encoder models?
Serving Strategy: Is there a standard design pattern for hosting 3 transformer models on a single consumer GPU without them stepping on each other's memory?
Any suggestions on replacing the models, changing the inference engine, or restructuring the deployment to keep latency low while entirely preventing these memory crashes would be amazing. Thanks!
[link] [comments]




