Power-limit vs TG/s for 2x3090

Reddit r/LocalLLaMA / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • A Reddit post discusses finding the best performance trade-off between GPU power limits and throughput (tg/s) when running a 2×3090 setup.
  • The author reports that 250W appears to be a “sweet spot” for Qwen3.6-27B, based on their observed results.
  • They also note that throughput increased at 275W for a single concurrent request, suggesting power/throughput scaling can vary with workload concurrency.
  • The post includes specific vLLM server configuration and a benchmark command used to measure results (vLLM with quantization, chunked prefill, prefix caching, and speculative configuration).
Power-limit vs TG/s for 2x3090

Trying to find the sweet-spot to tradeoff between power and tg/s.

250W seems to be a sweet spot for Qwen3.6-27B.

It's interesting that I got higher tg/s at 275W for 1 concurrent request

VLLM-server-config from tedivm vllm serve /models/Qwen3.6-27B-int4-AutoRound --tensor-parallel-size 2 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --gpu-memory-utilization 0.85 --served-model-name Qwen3.6-27B-int4-AutoRound --host 0.0.0.0 --port 8000 --enable-prefix-caching --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' --max-num-seqs 8 --quantization auto_round --kv-cache-dtype fp8 --enable-chunked-prefill --max-num-batched-tokens 4128 --disable-custom-all-reduce

Benchmark-cmd vllm bench serve --backend openai --dataset-name sharegpt --max-concurrency 1 --num-prompts 100 --base-url http://192.168.254.10:8000 --tokenizer Lorbus/Qwen3.6-27B-int4-AutoRound --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --seed 777

submitted by /u/JC1DA
[link] [comments]