Recently I did a little performance test of several LLMs on PC with 16GB VRAM

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • A Reddit user benchmarks multiple LLMs (Qwen 3.5, Gemma-4, Nemotron Cascade 2, and GLM 4.7 flash) on a PC with an RTX 4080 and 16GB VRAM.
  • The test focuses on how inference speed degrades as the context length increases.
  • They run the models using llama.cpp and use quantization choices tailored to fit within the 16GB VRAM constraint.
  • A comparison result table is shared to help readers interpret relative performance across models and context sizes.
Recently I did a little performance test of several LLMs on PC with 16GB VRAM

Qwen 3.5, Gemma-4, Nemotron Cascade 2 and GLM 4.7 flash.

Tested to see how performance (speed) degrades with the context increase.

used llama.cpp and some nice quants better fitting for 16GB VRAM in my RTX 4080.

Here is a result comparison table. Hope you find it useful.

https://preview.redd.it/ylafftgx76tg1.png?width=827&format=png&auto=webp&s=16d030952f1ea710cd3cef65b76e5ad2c3fd1cd3

submitted by /u/rosaccord
[link] [comments]