GLM 5.1 Locally: 40tps, 2000+ pp/s

Reddit r/LocalLLaMA / 4/26/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The author reports successfully running a GLM 5.1 reap-ed nvfp4 version locally and getting stable, fast inference on four RTX 6000 Pro GPUs limited to 350W.
  • Reported throughput is analyzed by context length, with prefill tokens per second (PP@4096) decreasing as context depth increases, while generation throughput (TG@512) remains relatively steady but also declines at longer contexts.
  • Peak burst generation throughput is around the low-40 tps range (about 43 tg peak), and overall performance with opencode is described as close to Sonnet + Claude Code.
  • The setup is said to handle 100–200k sessions stably, with the author planning to test different concurrency settings and noting that concurrency=2 yields ~65 tps on average during generation.
  • The post invites others to share whether they have achieved better performance on the same hardware.

After some sglang patching and countless experiments, managed to get reap-ed nvfp4 version running stable and FAST on 4 x RTX 6000 Pros (limited to 350W). Very happy with performance and quality. Inference software is still under-optimized for those cards. I think we will see their true potential unfold this or early next year.

Throughput by Context Depth

Prefilled PP@4096 TG@512
0 2229.0 42.03
4K 1943.6 41.41
16K 1558.9 39.72
32K 1234.2 38.19
64K 863.5 35.87

TG Peak (burst throughput)

43.00 42.00 40.00 39.00 37.00

Overall experience with opencode is pretty close to Sonnet + Claude Code. 100-200k sessions are stable.

Will play with different concurrency settings this weekend.

Anyone seen better performance on this hardware?

PS: concurrency = 2 worked great. Generation hits 65 tps average.

submitted by /u/val_in_tech
[link] [comments]