Qwen3.5-397B-A17B reaches 20 t/s TG and 700t/s PP with a 5090

Reddit r/LocalLLaMA / 3/25/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Read original →

共有:

Key Points

A user benchmarked Qwen3.5-397B-A17B (Q4_K_M GGUF) on a single NVIDIA GeForce RTX 5090 with 256GB DDR4 and reported token speeds using llama-bench.
For a short context setting, the model achieved about 717.9 t/s for PP (pp8192) and about 20.0 t/s for TG (tg128) in the provided benchmark output.
With 128k context length, throughput was about 562.2 ± 7.9 t/s (PP) while TG (tg128) dropped to about 17.9 ± 0.3 t/s.
The report focuses on performance feasibility and does not claim official model release; it highlights what speeds may be attainable with one 5090 and sufficient DDR4 RAM.
The benchmark setup details (EPYC 7532, PCIe 4.0 x16 link, 2TB NVMe) suggest the results depend heavily on system configuration and context length.

I could not find good data points on what speed one could get with a single 5090 and enough DDR4 RAM.

My system: AMD EPYC 7532 32core CPU, ASRock ROMED8-2T motherboard, 256GB 3200Mhz DDR4, one 5090 and 2TB NVME SSD.

Note that I bought this system before RAM crisis.

5090 is connected at PCIE4.0 x16 speed.

So, here are some speed metrics for Qwen3.5-397B-A17B Q4_K_M from bartowski/Qwen_Qwen3.5-397B-A17B-GGUF.

./build/bin/llama-bench -m /media/epyc-llm/disk/llm_models/Qwen_Qwen3.5-397B-A17B-GGUF/Qwen_Qwen3.5-397B-A17B-Q4_K_M/Qwen_Qwen3.5-397B-A17B-Q4_K_M-00001-of-00007.gguf -ot ".ffn_(up|down|gate)_exps.=CPU" -ngl 999 -b 8192 -ub 8192 -d 0 -p 8192 -mmp 0 -fa 1 ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | n_batch | n_ubatch | fa | ot | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------------- | --------------: | -------------------: | | qwen35moe 397B.A17B Q4_K - Medium | 225.25 GiB | 396.35 B | CUDA | 999 | 8192 | 8192 | 1 | .ffn_(up|down|gate)_exps.=CPU | pp8192 | 717.87 ± 1.82 | | qwen35moe 397B.A17B Q4_K - Medium | 225.25 GiB | 396.35 B | CUDA | 999 | 8192 | 8192 | 1 | .ffn_(up|down|gate)_exps.=CPU | tg128 | 20.00 ± 0.11 | build: c5a778891 (8233)

Here is the speed at 128k context:

./build/bin/llama-bench -fa 1 -m /media/epyc-llm/disk/llm_models/Qwen_Qwen3.5-397B-A17B-GGUF/Qwen_Qwen3.5-397B-A17B-Q4_K_M/Qwen_Qwen3.5-397B-A17B-Q4_K_M-00001-of-00007.gguf -ot ".ffn_(up|down|gate)_exps.=CPU" -ngl 99 -b 8192 -ub 8192 -d 128000 -p 8192 ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | n_batch | n_ubatch | fa | ot | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------------- | --------------: | -------------------: | | qwen35moe 397B.A17B Q4_K - Medium | 225.25 GiB | 396.35 B | CUDA | 99 | 8192 | 8192 | 1 | .ffn_(up|down|gate)_exps.=CPU | pp8192 @ d128000 | 562.19 ± 7.94 | | qwen35moe 397B.A17B Q4_K - Medium | 225.25 GiB | 396.35 B | CUDA | 99 | 8192 | 8192 | 1 | .ffn_(up|down|gate)_exps.=CPU | tg128 @ d128000 | 17.87 ± 0.33 |

And speed at 200k context:

./build/bin/llama-bench -m /media/epyc-llm/disk/llm_models/Qwen_Qwen3.5-397B-A17B-GGUF/Qwen_Qwen3.5-397B-A17B-Q4_K_M/Qwen_Qwen3.5-397B-A17B-Q4_K_M-00001-of-00007.gguf -ot ".ffn_(up|down|gate)_exps.=CPU" -ngl 999 -b 8192 -ub 8192 -d 200000 -p 8192 -mmp 0 -fa 1 ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | n_batch | n_ubatch | fa | ot | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------------- | --------------: | -------------------: | | qwen35moe 397B.A17B Q4_K - Medium | 225.25 GiB | 396.35 B | CUDA | 999 | 8192 | 8192 | 1 | .ffn_(up|down|gate)_exps.=CPU | pp8192 @ d200000 | 496.79 ± 3.25 | | qwen35moe 397B.A17B Q4_K - Medium | 225.25 GiB | 396.35 B | CUDA | 999 | 8192 | 8192 | 1 | .ffn_(up|down|gate)_exps.=CPU | tg128 @ d200000 | 16.97 ± 0.16 | build: c5a778891 (8233)

I also tried ik_llama with the same quant, but I was not able to get better results. TG was slightly faster but PP was lower.

./build/bin/llama-bench -m /media/epyc-llm/disk/llm_models/Qwen_Qwen3.5-397B-A17B-GGUF/Qwen_Qwen3.5-397B-A17B-Q4_K_M/Qwen_Qwen3.5-397B-A17B-Q4_K_M-00001-of-00007.gguf -b 8192 -ub 8192 -p 8192 -muge 1 -fa 1 -ot exps=CPU -mmp 0 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32106 MiB | model | size | params | backend | ngl | n_batch | n_ubatch | mmap | muge | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | ---: | ---: | ------------: | ---------------: | ~ggml_backend_cuda_context: have 0 graphs | qwen35moe 397B.A17B Q4_K - Medium | 360.25 GiB | 654.04 B | CUDA | 999 | 8192 | 8192 | 0 | 1 | pp8192 | 487.20 ± 7.61 | ~ggml_backend_cuda_context: have 181 graphs | qwen35moe 397B.A17B Q4_K - Medium | 360.25 GiB | 654.04 B | CUDA | 999 | 8192 | 8192 | 0 | 1 | tg128 | 20.86 ± 0.24 | ~ggml_backend_cuda_context: have 121 graphs build: 233225db (4347)

Power usage was around 400W for the entire system during TG.

It would be interesting to see Apple M5 Max or Ultra comparison here (when we get the ULTRA version) and other server setups with low GPU VRAM and high RAM.

submitted by /u/MLDataScientist
[link] [comments]

Sentiment Analysis API Tutorial: Build a Customer Review Dashboard

Dev.to

Teaching AI Agents to Handle NFTs: ERC-721, ERC-1155, and Metaplex

Dev.to

The Complete Guide to Model Context Protocol (MCP): Building AI-Native Applications in 2026

Dev.to

AI Agent Skill Security Report — 2026-03-25

Dev.to

How to Build Multi-Agent AI Systems That Actually Work: A 2026 Practical Guide

Dev.to

Qwen3.5-397B-A17B reaches 20 t/s TG and 700t/s PP with a 5090

Key Points

Related Articles

Sentiment Analysis API Tutorial: Build a Customer Review Dashboard

Teaching AI Agents to Handle NFTs: ERC-721, ERC-1155, and Metaplex

The Complete Guide to Model Context Protocol (MCP): Building AI-Native Applications in 2026

AI Agent Skill Security Report — 2026-03-25

How to Build Multi-Agent AI Systems That Actually Work: A 2026 Practical Guide

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer