Just by way of background: I am from the Midwest but I’m lawyer in South Carolina (and I am actually preparing for a trial next week and should be asleep). Have had my own Law firm for 11 years now.
About 4 months ago Claude code did some things that were pretty powerful and scared the shit out of me. Since then I’ve probably wasted more time than I gained, but I have been successful in automating a lot of low level paralegal type tasks, and have learned a lot. It has been fun along the way, or at least interesting in a way that I have enjoyed.
I got fixated on having a local private server running a local model that I could do Rag and Qlora/dora on. Still moving towards that goal when I’m not too busy with other things.
I was not building computers or successfully installing and running headless Linux servers, or setting up local networks four months ago, so I feel like there has been a good bit of progress on several fronts even if a fair bit of $$ has been misallocated and lots of time has been wasted along the way.
Anyhow, my first local AI machine is done and almost done done. It is a 10x sxm v100s on 2 4 card nvlink boards and a 2 card nvlink board on a threadripper pro with 256gbs of ddr4z I have my last 2 v100s coming, and another 2 card board for them. And then no more v100s. 12x32gb v100s will be this server’s final form. 384 gb of vram.
Maybe I’ll get another 4 card board for better parallelism… maybe. Or I’ll get a fourth rtx 3090 and some 64gb ram sticks for my other motherboard…
Man this is just the corniest mid life crisis I could have ever had.
Anyway I am still totally tied to Claude code, so I use it to orchestrate and install everything for me and to install and configure everything for me on my server. I am at the point where I’m starting to test different local models using different inference engines. There have been errors and miscommunications along the way. Linux kernels recompiled. New cuda not working so having to install vintage cuda.
I don’t know. Here are some initial testing results. I am not sure if they were slowed down because I was downloading 600gbs of gguf models while they ran, but I assume not. Tell me if this is ok, what I should do better, why I am stupid, etc. I’ll respond and tell you how rich I am or something as a defense mechanism.
Seriously tell me what I should be doing, other inference engines and settings, tips, whatever.
I guess really I want to know what model I can get to emulate my writing style, to recognize patterns, and to do low level legal reasoning form filling and pattern recognition. Which models can I Qlora? Tell me what do to please.
Today’s vLLM testing results are below (AI slop follows):
# vLLM on 10x V100 SXM2 32GB — Build Notes & Benchmarks
I’m a lawyer, not an engineer. I built this server for running local LLMs for legal work and have been learning as I go. The entire vLLM setup — source build, dependency fixes, benchmarking — was done through Claude Code (Opus). Posting this because I couldn’t find a clear guide for vLLM on V100 hardware and figured others might be in the same spot.
## Hardware
- **CPU:** AMD Threadripper PRO
- **GPUs:** 10x Tesla V100 SXM2 32GB (320 GB VRAM total)
- **Topology:** Two NVLink quad meshes (GPUs 0–3, 4/5/8/9) + NV6 pair (GPUs 6–7)
- **Driver:** NVIDIA 580.126.20
- **OS:** Ubuntu 24.04, headless
## What Works on V100 vLLM
- **FP16 unquantized:** Primary path. `--dtype half`
- **bitsandbytes 4-bit:** Works for models too large for FP16
- **TRITON_ATTN:** Automatic fallback since FlashAttention2 requires SM 80+
- **Tensor/Pipeline parallel:** TP=4 and TP=4 PP=2 both tested successfully
## What Does Not Work
- **GPTQ:** ExLlamaV2 kernels broken on SM 7.0 (vLLM issue #2165)
- **AWQ:** Requires SM 75+
- **FP8:** Requires SM 75+. MiniMax M2.5 uses FP8 internally — dead on arrival.
- **FlashAttention2:** Requires SM 80+
- **DeepSeek MLA:** Hopper/Blackwell only. Full DeepSeek V3/R1 cannot run on vLLM + V100.
## Build Requirements
- **PyTorch 2.11.0+cu126** — cu126 is the last version with V100 support. cu128+ drops Volta.
- **Source compile** with `TORCH_CUDA_ARCH_LIST="7.0"`, `MAX_JOBS=20`
- **MoE kernel patch** — issue #36008, change `B.size(1)` to `B.size(0)` in `fused_moe.py` (2 lines)
- **PYTHONNOUSERSITE=1** — required to isolate conda env from stale system packages
## Critical Fix: NCCL Dependency Conflict
`pip install -e .` pulls in `nvidia-nccl-cu13` alongside `nvidia-nccl-cu12`. The cu13 library gets loaded at runtime and references CUDA 13 symbols that don’t exist in the cu126 runtime. Result: “NCCL error: unhandled cuda error” on every multi-GPU launch.
**Fix:** uninstall all `nvidia-*` pip packages, reinstall PyTorch cu126 from the PyTorch wheel index (pulls correct cu12 deps), then reinstall vLLM editable with `--no-deps`.
## Required Launch Flags
```
--dtype half
--enforce-eager
--no-enable-chunked-prefill
--gpu-memory-utilization 0.90
CUDA_DEVICE_ORDER=PCI_BUS_ID
```
## Benchmark Results
FP16, enforce-eager, max-model-len 8192. Five prompts per model (256 max tokens). First request includes warmup overhead.
|Model |Params |GPUs|Config |Avg tok/s|Steady tok/s|
|-------------|--------|----|---------|---------|------------|
|Command R 32B|35B |4 |TP=4 |33.1 |35.2 |
|Gemma 4 31B |31B |4 |TP=4 |21.6 |21.6 |
|Qwen 2.5 72B |72B |8 |TP=4 PP=2|13.9 |14.9 |
|MiniMax M2.5 |456B MoE|8 |TP=4 PP=2|N/A (FP8)|N/A |
*Gemma 4’s lower throughput vs Command R at similar size is likely due to heterogeneous head dimensions (256/512) forcing additional overhead in the TRITON_ATTN path.*
## Models That Don’t Fit on vLLM V100
- **MiniMax M2.5:** FP8 weights. Needs SM 75+. Runs fine as GGUF on llama.cpp.
- **DeepSeek V3/V3.2/R1 (671B):** MLA attention kernels need Hopper. Use llama.cpp with `-cmoe`.
- **Llama 4 Maverick (400B MoE):** FP16 is ~800 GB. GGUF on Ollama/llama.cpp only.
## Setup Done Via
Claude Code (Opus 4) running on the server over SSH. I described what I wanted, it handled the source build, dependency debugging, NCCL fix, model downloads, and benchmarking. I’m learning the technical side but still rely on it for anything involving compilation or package management.
"NCCL error: cuda error" on every multi-GPU launch
[link] [comments]




