VRAM.cpp: Running llama-fit-params directly in your browser

Reddit r/LocalLLaMA / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The author addresses limitations of existing “VRAM calculator” tools by proposing a more accurate way to estimate whether specific GGUF quantized LLM variants can run on a user’s hardware.
VRAM.cpp runs llama.cpp’s fit algorithm directly in the browser to map model layers/tensors to available devices (e.g., GPU and host memory) and provide up-to-date estimates as llama.cpp improves.
The project is delivered as a web app with accompanying open-source code, aiming to help users choose the right quant without downloading very large model weights.
The author notes current issues in multi-GPU scenarios, especially when splitting across two GPUs and host RAM, and that MoE fitting may still be inconsistent due to llama.cpp backend limitations and missing backend buffer capability exposure.

Lots of people are always asking on this subreddit if their system can run a certain model. A lot of the "VRAM calculators" that I've found only provide either very rough estimates or are severely limited in the number of models they can estimate the usage for. These are both due to the complexity of figuring out how much memory is utilized for the numerous types of attention on the market today. This leads to a tool that works for a few people, but it doesn't answer the questio: "Can my 16GB GPU with 32GB of host ram run this specific Q3 quant variant from unsloth or bartowski?"

I set out to build something that would be regularly up-to-date, and provide accurate estimates for if, or how well a model will run on a given system. Llama.cpp already has a fit algorithm for assigning layers/tensors to different devices, and is continuing to get better and more robust. The answer is to just run the fit algorithm directly in your browser to estimate if a GGUF can run on the proposed system. An added benefit, is that as llama.cpp supports newer models, the estimator gets them as well.

App: https://acon96.github.io/vram.cpp/ Code: https://github.com/acon96/vram.cpp

There are still some weird behaviors with multi-gpu scenarios. In particular it behaves very strangely if you try to split a model across 2 GPUs AND the host memory. MoE fitting is also a bit wonky, but I'm pretty sure that is part of llama.cpp as well right now. Also still needs to add some other backend variants so the correct buffer capabilities are exposed

Hope this helps a few people get the right quant for their model without downloading 900GB of weights and spending a bunch of time running test fits.

submitted by /u/TheAconn96
[link] [comments]