AI Navigate

GLM 4.7 on dual RTX Pro 6000 Blackwell

Reddit r/LocalLLaMA / 3/16/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A Reddit post investigates whether the full 358B GLM 4.7 model can fit entirely into 192GB VRAM on dual RTX Pro 6000 Blackwell, and asks about feasible quantization (including NVFP4) with a batch size of 1 and input length under 4096 tokens.
  • The author notes that online VRAM calculators may be conservative and seeks real-world results rather than theoretical estimates.
  • If the 192GB setup cannot fit the model, the post asks for alternative model recommendations suitable for the same hardware and use case (roleplay and general tool calling with RAG).
  • The thread provides a link to the Reddit post by user mircM52 for discussion and comments.

Has anyone gotten this model (the full 358B version) to fit entirely into 192GB VRAM? If so, what's the highest quant (does NVFP4 fit)? Batch size 1, input sequence <4096 tokens. The theoretical calculators online say it just barely doesn't fit, but I think these tend to be conservative so I wanted to know if anyone actually got this working in practice.

If it doesn't fit, does anyone have other model recommendations for this setup? Primary use case is roleplay (nothing NSFW) and general assistance (basic tool calling and RAG).

Apologies if this has been asked before, I can't seem to find it! And thanks in advance!

submitted by /u/mircM52
[link] [comments]