Lots of people are always asking on this subreddit if their system can run a certain model. A lot of the "VRAM calculators" that I've found only provide either very rough estimates or are severely limited in the number of models they can estimate the usage for. These are both due to the complexity of figuring out how much memory is utilized for the numerous types of attention on the market today. This leads to a tool that works for a few people, but it doesn't answer the questio: "Can my 16GB GPU with 32GB of host ram run this specific Q3 quant variant from unsloth or bartowski?"
I set out to build something that would be regularly up-to-date, and provide accurate estimates for if, or how well a model will run on a given system. Llama.cpp already has a fit algorithm for assigning layers/tensors to different devices, and is continuing to get better and more robust. The answer is to just run the fit algorithm directly in your browser to estimate if a GGUF can run on the proposed system. An added benefit, is that as llama.cpp supports newer models, the estimator gets them as well.
App: https://acon96.github.io/vram.cpp/ Code: https://github.com/acon96/vram.cpp
There are still some weird behaviors with multi-gpu scenarios. In particular it behaves very strangely if you try to split a model across 2 GPUs AND the host memory. MoE fitting is also a bit wonky, but I'm pretty sure that is part of llama.cpp as well right now. Also still needs to add some other backend variants so the correct buffer capabilities are exposed
Hope this helps a few people get the right quant for their model without downloading 900GB of weights and spending a bunch of time running test fits.
[link] [comments]



