[Architecture Help] Serving Embed + Rerank + Zero-Shot Classifier on 8GB VRAM. Fighting System RAM Kills and Latency.

Reddit r/LocalLLaMA / 3/19/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The author is building a unified Knowledge Graph / RAG service in a single Docker container with FastAPI on Linux, constrained by 8GB VRAM and ~16GB system RAM.
They are experiencing severe memory and latency issues under stress, including GPU OOM on concurrent requests, CPU RAM bloat from zero-shot classification, and VRAM caching spikes that drain available memory.
A band-aid approach using asyncio.Lock to serialize GPU access and deterministic deallocation improves stability but remains unstable under load.
They seek smaller/faster models for zero-shot NLI and reranking that still maintain accuracy within an 8GB envelope, and are exploring optimized architectures (TEI, TensorRT, or similar) for encoder models.
They want a standard serving design pattern to host three transformer models on a single consumer GPU without mutual memory interference, especially given the need to handle long, untruncated sequences.

Hey everyone, I’ve been banging my head against the wall on this for a few weeks and could really use some architecture or MLOps advice.

I am building a unified Knowledge Graph / RAG service for a local coding agent. It runs in a single Docker container via FastAPI. Initially, it ran okay on Windows (WSL), but moving it to native Linux has exposed severe memory limit issues under stress tests.

Hardware Constraints:

• 8GB VRAM (Laptop GPU)

• ~16GB System RAM (Docker limits hit fast, usually only ~6GB free when models are loaded)

The Stack (The Models):

Embedding: nomic-ai/nomic-embed-text-v2-moe
Reranking: BAAI/bge-reranker-base
Classification: MoritzLaurer/ModernBERT-large-zeroshot-v2.0 (used to classify text pairs into 4 relations: dependency, expansion, contradiction, unrelated).

The Problem / The Nightmare:

Because I am feeding code chunks and natural text into these models, I cannot aggressively truncate the text. I need the models to process variable, long sequences.

Here is what I’ve run into:

• Latency vs. OOM: If I use torch.cuda.empty_cache() to keep the GPU clean, latency spikes to 18-20 seconds per request due to driver syncs. If I remove it, the GPU instantly OOMs when concurrent requests hit.

• System RAM Explosion (Linux Exit 137): Using the Hugging Face pipeline("zero-shot-classification") caused massive CPU RAM bloat. Without truncation, the pipeline generates massive combination matrices in memory before sending them to the GPU. The Linux kernel instantly kills the container.

• VRAM Spikes: cudnn.benchmark = True was caching workspaces for every unique sequence length, draining my 3GB of free VRAM in seconds during stress tests.

Current "Band-Aid" Implementation:

Right now, I have a pure Python/FastAPI setup. I bypassed the HF pipeline and wrote a manual NLI inference loop for ModernBERT. I am using asyncio.Lock() to force serial execution (only one model touches the GPU at a time) and using deterministic deallocation (del inputs + gc.collect()) via FastAPI background tasks.

It's better, but still unstable under a 3-minute stress test.

My Questions for the Community:

Model Alternatives: Are there smaller/faster models that maintain high accuracy for Zero-Shot NLI and Reranking that fit better in an 8GB envelope?
Prebuilt Architectures: I previously looked at infinity_emb but struggled to integrate my custom 4-way NLI classification logic into its wrapper without double-loading models. Should I be looking at TEI (Text Generation Inference), TensorRT, or something else optimized for Encoder models?
Serving Strategy: Is there a standard design pattern for hosting 3 transformer models on a single consumer GPU without them stepping on each other's memory?

Any suggestions on replacing the models, changing the inference engine, or restructuring the deployment to keep latency low while entirely preventing these memory crashes would be amazing. Thanks!

submitted by /u/CourtAdventurous_1
[link] [comments]

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

Ledge.ai

The programming passion is melting

Dev.to

Best AI Tools for Property Managers in 2026

Dev.to

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

[Architecture Help] Serving Embed + Rerank + Zero-Shot Classifier on 8GB VRAM. Fighting System RAM Kills and Latency.

Key Points

Related Articles

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

The programming passion is melting

Best AI Tools for Property Managers in 2026

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Related Articles

Manus、AIエージェントをデスクトップ化 ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

The programming passion is melting

Best AI Tools for Property Managers in 2026

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像