| I posted a few days ago about my setup here : https://www.reddit.com/r/LocalLLaMA/comments/1s0fje7/nvidia\_v100\_32\_gb\_getting\_115\_ts\_on\_qwen\_coder/ - Ryzen 7600 X & 32 Gb DDR5 - Nvidia V100 32 GB PCIExp (air cooled) I run a 6h benchmarks across 20 models (MOE & dense), from Nemotron…Qwen to Deepseek 70B with different configuration of : - Power limitation (300w, 250w, 200w, 150w) - CPU Offload (100% GPU, 75% GPU, 50% GPU, 25% GPU, 0% GPU) - Different context window (up to 32K) TLDR : - Power limiting is free for generation. Running at 200W saves 100W with <2% loss on tg128. MoE/hybrid models are bandwidth-bound. Only dense prompt processing shows degradation at 150W (−22%). Recommended daily: 200W. - MoE models handle offload far better than dense. Most MoE models retain 100% tg128 at ngl 50 — offloaded layers hold dormant experts. Dense models lose 71–83% immediately. gpt-oss is the offload champion — full speed down to ngl 30. - Architecture matters more than parameter count. Nemotron-30B Mamba2 at 152 t/s beats the dense Qwen3.5-40B at 21 t/s — a 7× speed advantage with fewer parameters and less VRAM. - V100 min power is 150W. 100W was rejected. The SXM2 range is 150–300W. At 150W, MoE models still deliver 90–97% performance. - Dense 70B offload is not viable. Peak 3.8 t/s. PCIe Gen 3 bandwidth is the bottleneck. An 80B MoE in VRAM (78 t/s) is 20× faster. - Best daily drivers on V100-32GB: Speed: Nemotron-30B Q3_K_M — 152 t/s, Mamba2 hybrid Code: Qwen3-Coder-30B Q4_K_M — 127 t/s, MoE All-round: Qwen3.5-35B-A3B Q4_K_M — 102 t/s, MoE Smarts: Qwen3-Next-80B IQ1_M — 78 t/s, 80B GatedDeltaNet [link] [comments] |
V100 32 Gb : 6h of benchmarks across 20 models with CPU offloading & power limitations
Reddit r/LocalLLaMA / 3/28/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- A 6-hour round of local LLM benchmarks on an air-cooled NVIDIA V100 32GB compared 20 different models (dense and MoE) across power limits (300W to 150W), CPU/GPU offloading levels, and context windows up to 32K.
- The results indicate that power limiting is largely “free” for generation until about 200W, where performance loss at ~tg128 is under 2%, while at 150W dense models degrade sharply (about −22% on that workload).
- MoE/hybrid architectures tolerate CPU offloading much better than dense models, with many MoE variants maintaining near-full throughput at higher offload ratios (e.g., ngl 50), whereas dense models drop substantially.
- Architectural choice can outweigh raw parameter count: a Nemotron-30B Mamba2 setup delivered ~7× higher tokens/sec than a denser Qwen3.5-40B under the tested conditions.
- Hardware constraints are a major factor: dense 70B offloading is largely impractical on this platform due to PCIe Gen3 bandwidth bottlenecks, while an MoE configuration that fits in VRAM can be dramatically faster.
Related Articles

Black Hat Asia
AI Business

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."
Dev.to
Top 5 LLM Gateway Alternatives After the LiteLLM Supply Chain Attack
Dev.to

Stop Counting Prompts — Start Reflecting on AI Fluency
Dev.to

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug
Dev.to