Update on 12x32gb sxm v100 cluster / local AI for legal drafting

Reddit r/LocalLLaMA / 5/26/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • A lawyer running local AI for legal drafting reports the final V100 server build: twelve V100-SXM2 32GB GPUs on a Threadripper Pro, carefully grouped per NVLink board to avoid large performance drops from cross-board PCIe/NUMA hops.
  • After hardware and software iterations, the setup switched from vLLM to llama.cpp because the desired MoE GGUF models perform poorly on Volta with vLLM, and llama.cpp provided better compatibility (including a fix for a Gemma chat-parser bug).
  • Performance testing shows MoE models achieve practical decode speeds on V100 (e.g., ~113 tok/s for Gemma-4-26B-A4B MoE and ~50 tok/s for Qwen3.5-122B-A10B MoE on four V100s), while dense 27–32B models are much slower (~20–28 tok/s) and large dense (~128B) models are near unusable.
  • The results led to a strategic pivot: instead of chasing large dense weights, the system design now focuses on MoE architectures to maintain usable throughput across long-context drafting prompts (over 25k tokens).
Update on 12x32gb sxm v100 cluster / local AI for legal drafting

Update from the lawyer with the V100 server. A few of you asked what I actually ended up running once the dust settled, so here it is. Still just a lawyer, still driving the whole thing through Claude Code, still not fully sure what I'm doing — but it works now, which is more than I could say last time.

First, the hardware caught up to the plan. The last two V100s are in, so the "final form" I promised is real: twelve V100-SXM2 32GB on the Threadripper Pro. It's Board A on GPUs {4,5,8,9}, Board B on {6,7,10,11}, an NVLink pair on {0,1}, and a mixed pair on {2,3} where one card is a 16GB. Split a model across two different NVLink boards and throughput falls off a cliff (the cross-board hop is PCIe/NUMA, not NVLink), so I keep every model inside one board. Learned that one the expensive way.

And yeah, I caved and built the second box. EPYC 7302P, 512gb RAM, 4x RTX 3090 + 2x V100-PCIe. The mid-life crisis remains on schedule.

The bigger change: I gave up on vLLM for the local models. Not because vLLM is bad — because the models I actually want are MoE GGUFs, and vLLM on Volta is a dead end for those (FP8/AWQ/Marlin all want SM75+, the GPTQ kernels are broken on 7.0). I moved the whole thing to llama.cpp (mainline — a recent build finally fixed a Gemma chat-parser bug that had been mangling my long prompts).

Here's the part that's the opposite of what my first post implied: on V100, dense models are a trap. Only MoE clears a usable speed. Rough decode numbers — Q8 GGUF, Q4 KV cache, flash-attn on, one 4-card board, on real drafting prompts (several thousand tokens of context, not a 5-token "hello"):

Model Type tok/s (decode)
Gemma-4-26B-A4B MoE ~113
Qwen3.6-35B-A3B MoE ~82
Qwen3.5-122B-A10B MoE ~50
any dense 27-32B dense ~20-28 (under my 40 floor, not worth it)
dense ~128B dense ~9 (forget it)

So a 122B/10B-active reasoning model runs at ~50 tok/s on four V100s — faster than the dense 32B managed on vLLM in my first post — and it holds that at long context (I've pushed Gemma past 25k tokens without it falling apart, where the dense models choked). That reframed everything: I stopped chasing big dense weights and built the system around MoE.

What's actually running (the stack you asked for):
It isn't one model answering chat — it's an orchestrator that routes a legal task across several local models, each pinned to its own board so they don't fight over GPUs. When it runs the heaviest job (a full affidavit or motion, intake-to-document), it lights up 16 GPUs across both boxes:

- Workhorse drafting — Qwen3.6-35B-A3B on Board A {4,5,8,9}
- Heavy reasoning + high-stakes drafting — Qwen3.5-122B-A10B on Board B {6,7,10,11}
- A small "does this even have grounds" gate model on the {0,1} pair
- An adversarial reviewer whose entire job is to attack my own draft, on the {2,3} pair
- Gemma-4-26B for financial/extraction + a small Qwen as the router, on the 3090s on the second box via Ollama

It's a sequential pipeline so they don't all hammer at once, but all 16 stay resident. Lighter work uses far less — combining and Bates-stamping exhibits is pure CPU (PyMuPDF + Tesseract, no GPU at all); a plain summary mostly just hits Gemma and the router.

The honest part, since this sub kept me honest last time:
- The local models hallucinate citations and dates. Confidently. I had to build a verifier that checks every cite, date, and Bates number in a draft against the actual source material and blocks anything it can't ground, on top of the adversarial reviewer. Local drafting is bimodal — sometimes it correctly refuses to invent, sometimes it fabricates a whole dated chronology and swears in the same breath that it invented nothing. It does not touch a final document without that gate and without me.
- The dumbest bug I found: my own pipeline was ~79% poisoned. The thing that builds the evidence bundle was scooping up its OWN prior outputs as if they were client evidence, so the models were "grounding" on slop they'd written earlier — at one point it cited an RTX 3060 as a Bates number, which, fair. Fixed the builder to stop eating its own tail and scrubbed it out. If you run any RAG/agent pipeline, go look at what's literally in your context window — mine was a hall of mirrors and I had no idea.
- I also made it refuse to quietly fall back to a cloud model when I tell it to run local-only. If it can't do a step locally it says so, by name, instead of phoning Anthropic behind my back.

Still want the exact thing I wanted in the first post — a model that writes like me and handles the boring form-filling and pattern stuff. I'm closer: the system now captures my edits as correction data, which is the start of a real fine-tune set. Haven't pulled the QLoRA trigger yet. So the same questions stand, and I'd genuinely take advice:
- For QLoRA on this hardware (V100, no bf16, no FA2): do you reach for a 35B-A3B MoE base, or am I smarter to fine-tune a dense ~14B I can actually train and keep the MoE for the heavy serving?
- Anyone serving MoE on Volta found anything faster than llama.cpp — ik_llama, something else? And is there a better long-context KV story than Q4?
- Am I an idiot keeping 122B-A10B around at 50 tok/s when I could just run the 35B for everything?

Tell me what I'm doing wrong.

submitted by /u/TumbleweedNew6515
[link] [comments]