2×3090におけるパワーリミットとTG/sの比較

Reddit r/LocalLLaMA / 2026/4/28

💬 オピニオンDeveloper Stack & InfrastructureTools & Practical Usage

原文を読む →

共有:

要点

Redditの投稿では、2×3090環境でパワーリミットとスループット（tg/s）の最適なトレードオフを探る話がされています。
著者は、Qwen3.6-27Bでは250Wが「スイートスポット」に見えると、観測結果にもとづいて述べています。
また、同時リクエスト数が1の場合は275Wでスループットが伸びたため、パワーとスループットの関係はワークロードの並列度によって変わりうることを示唆しています。
投稿には、測定に用いたvLLMの具体的なサーバ設定とベンチマークコマンド（量子化、chunked prefill、prefix caching、speculative設定など）が掲載されています。

電力とtg/sのトレードオフにおける最適なポイント（スイートスポット）を探しています。

250WはQwen3.6-27Bにとってスイートスポットのようです。

1つの同時リクエストでは、275Wのときにより高いtg/sが出たのは面白いです

from tedivm のVLLM-server-config： vllm serve /models/Qwen3.6-27B-int4-AutoRound --tensor-parallel-size 2 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --gpu-memory-utilization 0.85 --served-model-name Qwen3.6-27B-int4-AutoRound --host 0.0.0.0 --port 8000 --enable-prefix-caching --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' --max-num-seqs 8 --quantization auto_round --kv-cache-dtype fp8 --enable-chunked-prefill --max-num-batched-tokens 4128 --disable-custom-all-reduce

Benchmark-cmd： vllm bench serve --backend openai --dataset-name sharegpt --max-concurrency 1 --num-prompts 100 --base-url http://192.168.254.10:8000 --tokenizer Lorbus/Qwen3.6-27B-int4-AutoRound --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --seed 777

submitted by /u/JC1DA
[リンク] [コメント]