Update on 12x32gb sxm v100 cluster / local AI for legal drafting

Reddit r/LocalLLaMA / 5/26/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

A lawyer running local AI for legal drafting reports the final V100 server build: twelve V100-SXM2 32GB GPUs on a Threadripper Pro, carefully grouped per NVLink board to avoid large performance drops from cross-board PCIe/NUMA hops.
After hardware and software iterations, the setup switched from vLLM to llama.cpp because the desired MoE GGUF models perform poorly on Volta with vLLM, and llama.cpp provided better compatibility (including a fix for a Gemma chat-parser bug).
Performance testing shows MoE models achieve practical decode speeds on V100 (e.g., ~113 tok/s for Gemma-4-26B-A4B MoE and ~50 tok/s for Qwen3.5-122B-A10B MoE on four V100s), while dense 27–32B models are much slower (~20–28 tok/s) and large dense (~128B) models are near unusable.
The results led to a strategic pivot: instead of chasing large dense weights, the system design now focuses on MoE architectures to maintain usable throughput across long-context drafting prompts (over 25k tokens).

Update on 12x32gb sxm v100 cluster / local AI for legal drafting

Update from the lawyer with the V100 server. A few of you asked what I actually ended up running once the dust settled, so here it is. Still just a lawyer, still driving the whole thing through Claude Code, still not fully sure what I'm doing — but it works now, which is more than I could say last time.

First, the hardware caught up to the plan. The last two V100s are in, so the "final form" I promised is real: twelve V100-SXM2 32GB on the Threadripper Pro. It's Board A on GPUs {4,5,8,9}, Board B on {6,7,10,11}, an NVLink pair on {0,1}, and a mixed pair on {2,3} where one card is a 16GB. Split a model across two different NVLink boards and throughput falls off a cliff (the cross-board hop is PCIe/NUMA, not NVLink), so I keep every model inside one board. Learned that one the expensive way.

And yeah, I caved and built the second box. EPYC 7302P, 512gb RAM, 4x RTX 3090 + 2x V100-PCIe. The mid-life crisis remains on schedule.

The bigger change: I gave up on vLLM for the local models. Not because vLLM is bad — because the models I actually want are MoE GGUFs, and vLLM on Volta is a dead end for those (FP8/AWQ/Marlin all want SM75+, the GPTQ kernels are broken on 7.0). I moved the whole thing to llama.cpp (mainline — a recent build finally fixed a Gemma chat-parser bug that had been mangling my long prompts).

Here's the part that's the opposite of what my first post implied: on V100, dense models are a trap. Only MoE clears a usable speed. Rough decode numbers — Q8 GGUF, Q4 KV cache, flash-attn on, one 4-card board, on real drafting prompts (several thousand tokens of context, not a 5-token "hello"):

Model	Type	tok/s (decode)
Gemma-4-26B-A4B	MoE	~113
Qwen3.6-35B-A3B	MoE	~82
Qwen3.5-122B-A10B	MoE	~50
any dense 27-32B	dense	~20-28 (under my 40 floor, not worth it)
dense ~128B	dense	~9 (forget it)

So a 122B/10B-active reasoning model runs at ~50 tok/s on four V100s — faster than the dense 32B managed on vLLM in my first post — and it holds that at long context (I've pushed Gemma past 25k tokens without it falling apart, where the dense models choked). That reframed everything: I stopped chasing big dense weights and built the system around MoE.

What's actually running (the stack you asked for):
It isn't one model answering chat — it's an orchestrator that routes a legal task across several local models, each pinned to its own board so they don't fight over GPUs. When it runs the heaviest job (a full affidavit or motion, intake-to-document), it lights up 16 GPUs across both boxes:

- Workhorse drafting — Qwen3.6-35B-A3B on Board A {4,5,8,9}
- Heavy reasoning + high-stakes drafting — Qwen3.5-122B-A10B on Board B {6,7,10,11}
- A small "does this even have grounds" gate model on the {0,1} pair
- An adversarial reviewer whose entire job is to attack my own draft, on the {2,3} pair
- Gemma-4-26B for financial/extraction + a small Qwen as the router, on the 3090s on the second box via Ollama

It's a sequential pipeline so they don't all hammer at once, but all 16 stay resident. Lighter work uses far less — combining and Bates-stamping exhibits is pure CPU (PyMuPDF + Tesseract, no GPU at all); a plain summary mostly just hits Gemma and the router.

The honest part, since this sub kept me honest last time:
- The local models hallucinate citations and dates. Confidently. I had to build a verifier that checks every cite, date, and Bates number in a draft against the actual source material and blocks anything it can't ground, on top of the adversarial reviewer. Local drafting is bimodal — sometimes it correctly refuses to invent, sometimes it fabricates a whole dated chronology and swears in the same breath that it invented nothing. It does not touch a final document without that gate and without me.
- The dumbest bug I found: my own pipeline was ~79% poisoned. The thing that builds the evidence bundle was scooping up its OWN prior outputs as if they were client evidence, so the models were "grounding" on slop they'd written earlier — at one point it cited an RTX 3060 as a Bates number, which, fair. Fixed the builder to stop eating its own tail and scrubbed it out. If you run any RAG/agent pipeline, go look at what's literally in your context window — mine was a hall of mirrors and I had no idea.
- I also made it refuse to quietly fall back to a cloud model when I tell it to run local-only. If it can't do a step locally it says so, by name, instead of phoning Anthropic behind my back.

Still want the exact thing I wanted in the first post — a model that writes like me and handles the boring form-filling and pattern stuff. I'm closer: the system now captures my edits as correction data, which is the start of a real fine-tune set. Haven't pulled the QLoRA trigger yet. So the same questions stand, and I'd genuinely take advice:
- For QLoRA on this hardware (V100, no bf16, no FA2): do you reach for a 35B-A3B MoE base, or am I smarter to fine-tune a dense ~14B I can actually train and keep the MoE for the heavy serving?
- Anyone serving MoE on Volta found anything faster than llama.cpp — ik_llama, something else? And is there a better long-context KV story than Q4?
- Am I an idiot keeping 122B-A10B around at 50 tok/s when I could just run the 35B for everything?

Tell me what I'm doing wrong.

submitted by /u/TumbleweedNew6515
[link] [comments]

Black Hat USA

AI Business

Building Conifer, an open-source local inference runtime (free + open source)

Reddit r/artificial

Aiki my local Wikipedia Retrieval-Augmented Generation system [R]

Reddit r/MachineLearning

로컬 LLM 셋업 가이드 (v40)

Dev.to

A prompt is not a conversation. It's a component contract.

Dev.to

Update on 12x32gb sxm v100 cluster / local AI for legal drafting

Key Points

Related Articles

Black Hat USA

Building Conifer, an open-source local inference runtime (free + open source)

Aiki my local Wikipedia Retrieval-Augmented Generation system [R]

로컬 LLM 셋업 가이드 (v40)

A prompt is not a conversation. It's a component contract.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer