Power-limit vs TG/s for 2x3090

Reddit r/LocalLLaMA / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Read original →

共有:

Key Points

A Reddit post discusses finding the best performance trade-off between GPU power limits and throughput (tg/s) when running a 2×3090 setup.
The author reports that 250W appears to be a “sweet spot” for Qwen3.6-27B, based on their observed results.
They also note that throughput increased at 275W for a single concurrent request, suggesting power/throughput scaling can vary with workload concurrency.
The post includes specific vLLM server configuration and a benchmark command used to measure results (vLLM with quantization, chunked prefill, prefix caching, and speculative configuration).

Trying to find the sweet-spot to tradeoff between power and tg/s.

250W seems to be a sweet spot for Qwen3.6-27B.

It's interesting that I got higher tg/s at 275W for 1 concurrent request

VLLM-server-config from tedivm vllm serve /models/Qwen3.6-27B-int4-AutoRound --tensor-parallel-size 2 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --gpu-memory-utilization 0.85 --served-model-name Qwen3.6-27B-int4-AutoRound --host 0.0.0.0 --port 8000 --enable-prefix-caching --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' --max-num-seqs 8 --quantization auto_round --kv-cache-dtype fp8 --enable-chunked-prefill --max-num-batched-tokens 4128 --disable-custom-all-reduce

Benchmark-cmd vllm bench serve --backend openai --dataset-name sharegpt --max-concurrency 1 --num-prompts 100 --base-url http://192.168.254.10:8000 --tokenizer Lorbus/Qwen3.6-27B-int4-AutoRound --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --seed 777

submitted by /u/JC1DA
[link] [comments]

Black Hat USA

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

How I Automate My Dev Workflow with Claude Code Hooks

Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

Dev.to

Power-limit vs TG/s for 2x3090

Key Points

Related Articles

Black Hat USA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

How I Automate My Dev Workflow with Claude Code Hooks

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer