V100 32 Gb : 6h of benchmarks across 20 models with CPU offloading & power limitations

Reddit r/LocalLLaMA / 3/28/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

A 6-hour round of local LLM benchmarks on an air-cooled NVIDIA V100 32GB compared 20 different models (dense and MoE) across power limits (300W to 150W), CPU/GPU offloading levels, and context windows up to 32K.
The results indicate that power limiting is largely “free” for generation until about 200W, where performance loss at ~tg128 is under 2%, while at 150W dense models degrade sharply (about −22% on that workload).
MoE/hybrid architectures tolerate CPU offloading much better than dense models, with many MoE variants maintaining near-full throughput at higher offload ratios (e.g., ngl 50), whereas dense models drop substantially.
Architectural choice can outweigh raw parameter count: a Nemotron-30B Mamba2 setup delivered ~7× higher tokens/sec than a denser Qwen3.5-40B under the tested conditions.
Hardware constraints are a major factor: dense 70B offloading is largely impractical on this platform due to PCIe Gen3 bandwidth bottlenecks, while an MoE configuration that fits in VRAM can be dramatically faster.

V100 32 Gb : 6h of benchmarks across 20 models with CPU offloading & power limitations

I posted a few days ago about my setup here : https://www.reddit.com/r/LocalLLaMA/comments/1s0fje7/nvidia\_v100\_32\_gb\_getting\_115\_ts\_on\_qwen\_coder/

- Ryzen 7600 X & 32 Gb DDR5

- Nvidia V100 32 GB PCIExp (air cooled)

I run a 6h benchmarks across 20 models (MOE & dense), from Nemotron…Qwen to Deepseek 70B with different configuration of :

- Power limitation (300w, 250w, 200w, 150w)

- CPU Offload (100% GPU, 75% GPU, 50% GPU, 25% GPU, 0% GPU)

- Different context window (up to 32K)

TLDR :

- Power limiting is free for generation.

Running at 200W saves 100W with <2% loss on tg128. MoE/hybrid models are bandwidth-bound. Only dense prompt processing shows degradation at 150W (−22%). Recommended daily: 200W.

- MoE models handle offload far better than dense.

Most MoE models retain 100% tg128 at ngl 50 — offloaded layers hold dormant experts. Dense models lose 71–83% immediately. gpt-oss is the offload champion — full speed down to ngl 30.

- Architecture matters more than parameter count.

Nemotron-30B Mamba2 at 152 t/s beats the dense Qwen3.5-40B at 21 t/s — a 7× speed advantage with fewer parameters and less VRAM.

- V100 min power is 150W.

100W was rejected. The SXM2 range is 150–300W. At 150W, MoE models still deliver 90–97% performance.

- Dense 70B offload is not viable.

Peak 3.8 t/s. PCIe Gen 3 bandwidth is the bottleneck. An 80B MoE in VRAM (78 t/s) is 20× faster.

- Best daily drivers on V100-32GB:

Speed: Nemotron-30B Q3_K_M — 152 t/s, Mamba2 hybrid

Code: Qwen3-Coder-30B Q4_K_M — 127 t/s, MoE

All-round: Qwen3.5-35B-A3B Q4_K_M — 102 t/s, MoE

Smarts: Qwen3-Next-80B IQ1_M — 78 t/s, 80B GatedDeltaNet

submitted by /u/icepatfork
[link] [comments]

Black Hat Asia

AI Business

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."

Dev.to

Top 5 LLM Gateway Alternatives After the LiteLLM Supply Chain Attack

Dev.to

Stop Counting Prompts — Start Reflecting on AI Fluency

Dev.to

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug

Dev.to

V100 32 Gb : 6h of benchmarks across 20 models with CPU offloading & power limitations

Key Points

Related Articles

Black Hat Asia

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."

Top 5 LLM Gateway Alternatives After the LiteLLM Supply Chain Attack

Stop Counting Prompts — Start Reflecting on AI Fluency

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer