Mistral Medium 3.5 128B and Qwen 3.5 122B A10B on 4x RTX 3080 20GB

Reddit r/LocalLLaMA / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

Mistral Medium 3.5 128B（GGUF/ IQ4_XS）を4枚のRTX 3080 20GBで動かし、レイヤースプリット／テンソル並列の2方式でllama-benchの性能テスト結果が示された。
レイヤースプリットでは fa=1 の設定下で pp512 が約330.87 t/s、tg128 が約10.37 t/s、さらに d16384 が約216.76 t/s といったように条件により大きく性能差が出ている。
テンソル並列（llama.cppの最近のPRを反映）では、tp相当の分割により pp512 が約233.88 t/s、tg128 が約21.59 t/s、d16384 条件でも pp512 が約214.34 t/s などの結果になった。
投稿者はテンソル並列（TP4）により旧方式のレイヤースプリット比で tg の速度が約2倍になる見通しで、チャット用途では許容できる一方、モデル性能がサイズに見合うかはGemma-4-31Bなどとの比較で疑問があると述べている。

Mistral Medium 3.5 128B with 4x3080 20GB with layer split:

CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-bench --model /data/huggingface/Mistral-Medium-3.5-GGUF/Mistral-Medium-3.5-128B-IQ4_XS-00001-of-00003. gguf -ngl 99 -d 0,16384 -fa 1 --split-mode layer ggml_cuda_init: found 4 CUDA devices (Total VRAM: 80211 MiB): Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB Device 2: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB Device 3: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB | model | size | params | backend | threads | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------: | -------------------: | | mistral3 ?B IQ4_XS - 4.25 bpw | 62.51 GiB | 125.03 B | CUDA,BLAS | 64 | 1 | pp512 | 330.87 ± 0.99 | | mistral3 ?B IQ4_XS - 4.25 bpw | 62.51 GiB | 125.03 B | CUDA,BLAS | 64 | 1 | tg128 | 10.37 ± 0.00 | | mistral3 ?B IQ4_XS - 4.25 bpw | 62.51 GiB | 125.03 B | CUDA,BLAS | 64 | 1 | pp512 @ d16384 | 216.76 ± 0.26 | | mistral3 ?B IQ4_XS - 4.25 bpw | 62.51 GiB | 125.03 B | CUDA,BLAS | 64 | 1 | tg128 @ d16384 | 9.30 ± 0.00 | build: d05fe1d (275)

With tensor parallel from recent llama.cpp PR

CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-bench --model /data/huggingface/Mistral-Medium-3.5-GGUF/Mistral-Medium-3.5-128B-IQ4_XS-00001-of-00003.gguf -ngl 99 -d 0,16384 -fa 1 --split-mode tensor ggml_cuda_init: found 4 CUDA devices (Total VRAM: 80211 MiB): Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB Device 2: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB Device 3: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB | model | size | params | backend | threads | sm | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -: | --------------: | -------------------: | | mistral3 ?B IQ4_XS - 4.25 bpw | 62.51 GiB | 125.03 B | CUDA,BLAS | 64 | tensor | 1 | pp512 | 233.88 ± 1.01 | | mistral3 ?B IQ4_XS - 4.25 bpw | 62.51 GiB | 125.03 B | CUDA,BLAS | 64 | tensor | 1 | tg128 | 21.59 ± 0.05 | | mistral3 ?B IQ4_XS - 4.25 bpw | 62.51 GiB | 125.03 B | CUDA,BLAS | 64 | tensor | 1 | pp512 @ d16384 | 214.34 ± 4.16 | | mistral3 ?B IQ4_XS - 4.25 bpw | 62.51 GiB | 125.03 B | CUDA,BLAS | 64 | tensor | 1 | tg128 @ d16384 | 20.31 ± 0.17 | build: d05fe1d (275)

TP4 would bring ~2x tg speed compared to old layer split. I think the speed is acceptable for chat, but the model itself is not great, and may not justify its size when comparing it against Gemma-4-31B or Qwen3.6-27B.

Here are some comparison with similar sized MoE model Qwen3.5-122B-A10B. TP from llama.cpp may not improve generation speed for Qwen3.5 MoE in this setup.

CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-bench --model /data/huggingface/Qwen3.5-122B-GGUF/Qwen3.5-122B-A10B-UD-IQ4_XS-00001-of-00003.gguf -ngl 99 -d 0,16384 -fa 1 --split-mode layer ggml_cuda_init: found 4 CUDA devices (Total VRAM: 80211 MiB): Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB Device 2: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB Device 3: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB | model | size | params | backend | threads | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------: | -------------------: | | qwen35moe 122B.A10B IQ4_XS - 4.25 bpw | 56.08 GiB | 122.11 B | CUDA,BLAS | 64 | 1 | pp512 | 1087.44 ± 6.95 | | qwen35moe 122B.A10B IQ4_XS - 4.25 bpw | 56.08 GiB | 122.11 B | CUDA,BLAS | 64 | 1 | tg128 | 60.08 ± 0.80 | | qwen35moe 122B.A10B IQ4_XS - 4.25 bpw | 56.08 GiB | 122.11 B | CUDA,BLAS | 64 | 1 | pp512 @ d16384 | 945.88 ± 6.70 | | qwen35moe 122B.A10B IQ4_XS - 4.25 bpw | 56.08 GiB | 122.11 B | CUDA,BLAS | 64 | 1 | tg128 @ d16384 | 57.78 ± 0.72 | build: d05fe1d (275) CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-bench --model /data/huggingface/Qwen3.5-122B-GGUF/Qwen3.5-122B-A10B-UD-IQ4_XS-00001-of-00003.gguf -ngl 99 -d 0,16384 -fa 1 --split-mode tensor ggml_cuda_init: found 4 CUDA devices (Total VRAM: 80211 MiB): Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB Device 2: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB Device 3: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 20052 MiB | model | size | params | backend | threads | sm | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -: | --------------: | -------------------: | | qwen35moe 122B.A10B IQ4_XS - 4.25 bpw | 56.08 GiB | 122.11 B | CUDA,BLAS | 64 | tensor | 1 | pp512 | 1216.15 ± 16.63 | | qwen35moe 122B.A10B IQ4_XS - 4.25 bpw | 56.08 GiB | 122.11 B | CUDA,BLAS | 64 | tensor | 1 | tg128 | 53.49 ± 0.29 | | qwen35moe 122B.A10B IQ4_XS - 4.25 bpw | 56.08 GiB | 122.11 B | CUDA,BLAS | 64 | tensor | 1 | pp512 @ d16384 | 1110.03 ± 42.33 | | qwen35moe 122B.A10B IQ4_XS - 4.25 bpw | 56.08 GiB | 122.11 B | CUDA,BLAS | 64 | tensor | 1 | tg128 @ d16384 | 56.67 ± 1.39 | build: d05fe1d (275)

vLLM wins here, even using a memory efficient config with MTP off, a tuned cuda graph config, and --language-model-only. But that config only leaves ~64k KV cache, 4x3090 setups would be much better for the model.

CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Qwen3.5-122B-A10B-GPTQ-Int4 -tp 4 --max-model-len 65536 --gpu-memory-utilization 0.97 --max-num-seqs 8 --tool-call-parser qwen3_xml --reasoning-parser qwen3 --enable-auto-tool-choice --enable-prefix-caching --enable-expert-parallel --compilation_config '{"mode": 3,"cudagraph_mode": "FULL_DECODE_ONLY","cudagraph_capture_sizes": [1,2,4,8]}' --language-model-only vllm bench serve --dataset-name random --num-prompts 8 --backend vllm --host 127.0.0.1 --port 8000 --max-concurrency 8 --tokenizer Qwen3.5-4B --model Qwen3.5-122B-A10B-GPTQ-Int4 --random-input-len 2048 --output-len 256 ============ Serving Benchmark Result ============ Successful requests: 8 Failed requests: 0 Maximum request concurrency: 8 Benchmark duration (s): 10.95 Total input tokens: 16384 Total generated tokens: 2048 Request throughput (req/s): 0.73 Output token throughput (tok/s): 187.04 Peak output token throughput (tok/s): 416.00 Peak concurrent requests: 8.00 Total token throughput (tok/s): 1683.40 ---------------Time to First Token---------------- Mean TTFT (ms): 3541.17 Median TTFT (ms): 3572.61 P99 TTFT (ms): 5782.89 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 28.39 Median TPOT (ms): 28.59 P99 TPOT (ms): 37.35 ---------------Inter-token Latency---------------- Mean ITL (ms): 28.39 Median ITL (ms): 19.86 P99 ITL (ms): 327.19 ================================================== vllm bench serve --dataset-name random --num-prompts 16 --backend vllm --host 127.0.0.1 --port 8000 --max-concurrency 1 --tokenizer Qwen3.5-4B --model Qwen3.5-122B-A10B-GPTQ-Int4 --random-input-len 2048 --output-len 256 ============ Serving Benchmark Result ============ Successful requests: 16 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 61.08 Total input tokens: 32768 Total generated tokens: 4096 Request throughput (req/s): 0.26 Output token throughput (tok/s): 67.06 Peak output token throughput (tok/s): 131.00 Peak concurrent requests: 2.00 Total token throughput (tok/s): 603.58 ---------------Time to First Token---------------- Mean TTFT (ms): 732.35 Median TTFT (ms): 651.94 P99 TTFT (ms): 1763.69 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 12.10 Median TPOT (ms): 11.61 P99 TPOT (ms): 13.45 ---------------Inter-token Latency---------------- Mean ITL (ms): 12.10 Median ITL (ms): 11.55 P99 ITL (ms): 27.51 ==================================================

submitted by /u/lly0571
[link] [comments]

ALM on Power Platform: ADO + GitHub, the best of both worlds

Dev.to

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Dev.to

Open source models are going to be the future on Cursor, OpenCode etc.

Reddit r/LocalLLaMA

How I Automated VPN Deployment with AI: The World's First AI-Powered VPN Kit

Dev.to

Mistral Medium 3.5 128B and Qwen 3.5 122B A10B on 4x RTX 3080 20GB

Key Points

Related Articles

ALM on Power Platform: ADO + GitHub, the best of both worlds

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Open source models are going to be the future on Cursor, OpenCode etc.

How I Automated VPN Deployment with AI: The World's First AI-Powered VPN Kit

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer