| Update from the lawyer with the V100 server. A few of you asked what I actually ended up running once the dust settled, so here it is. Still just a lawyer, still driving the whole thing through Claude Code, still not fully sure what I'm doing — but it works now, which is more than I could say last time. First, the hardware caught up to the plan. The last two V100s are in, so the "final form" I promised is real: twelve V100-SXM2 32GB on the Threadripper Pro. It's Board A on GPUs {4,5,8,9}, Board B on {6,7,10,11}, an NVLink pair on {0,1}, and a mixed pair on {2,3} where one card is a 16GB. Split a model across two different NVLink boards and throughput falls off a cliff (the cross-board hop is PCIe/NUMA, not NVLink), so I keep every model inside one board. Learned that one the expensive way. And yeah, I caved and built the second box. EPYC 7302P, 512gb RAM, 4x RTX 3090 + 2x V100-PCIe. The mid-life crisis remains on schedule. The bigger change: I gave up on vLLM for the local models. Not because vLLM is bad — because the models I actually want are MoE GGUFs, and vLLM on Volta is a dead end for those (FP8/AWQ/Marlin all want SM75+, the GPTQ kernels are broken on 7.0). I moved the whole thing to llama.cpp (mainline — a recent build finally fixed a Gemma chat-parser bug that had been mangling my long prompts). Here's the part that's the opposite of what my first post implied: on V100, dense models are a trap. Only MoE clears a usable speed. Rough decode numbers — Q8 GGUF, Q4 KV cache, flash-attn on, one 4-card board, on real drafting prompts (several thousand tokens of context, not a 5-token "hello"):
So a 122B/10B-active reasoning model runs at ~50 tok/s on four V100s — faster than the dense 32B managed on vLLM in my first post — and it holds that at long context (I've pushed Gemma past 25k tokens without it falling apart, where the dense models choked). That reframed everything: I stopped chasing big dense weights and built the system around MoE. What's actually running (the stack you asked for): - Workhorse drafting — Qwen3.6-35B-A3B on Board A {4,5,8,9} It's a sequential pipeline so they don't all hammer at once, but all 16 stay resident. Lighter work uses far less — combining and Bates-stamping exhibits is pure CPU (PyMuPDF + Tesseract, no GPU at all); a plain summary mostly just hits Gemma and the router. The honest part, since this sub kept me honest last time: Still want the exact thing I wanted in the first post — a model that writes like me and handles the boring form-filling and pattern stuff. I'm closer: the system now captures my edits as correction data, which is the start of a real fine-tune set. Haven't pulled the QLoRA trigger yet. So the same questions stand, and I'd genuinely take advice: Tell me what I'm doing wrong. [link] [comments] |
Update on 12x32gb sxm v100 cluster / local AI for legal drafting
Reddit r/LocalLLaMA / 5/26/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- A lawyer running local AI for legal drafting reports the final V100 server build: twelve V100-SXM2 32GB GPUs on a Threadripper Pro, carefully grouped per NVLink board to avoid large performance drops from cross-board PCIe/NUMA hops.
- After hardware and software iterations, the setup switched from vLLM to llama.cpp because the desired MoE GGUF models perform poorly on Volta with vLLM, and llama.cpp provided better compatibility (including a fix for a Gemma chat-parser bug).
- Performance testing shows MoE models achieve practical decode speeds on V100 (e.g., ~113 tok/s for Gemma-4-26B-A4B MoE and ~50 tok/s for Qwen3.5-122B-A10B MoE on four V100s), while dense 27–32B models are much slower (~20–28 tok/s) and large dense (~128B) models are near unusable.
- The results led to a strategic pivot: instead of chasing large dense weights, the system design now focuses on MoE architectures to maintain usable throughput across long-context drafting prompts (over 25k tokens).


