| Qwen 3.5, Gemma-4, Nemotron Cascade 2 and GLM 4.7 flash. Tested to see how performance (speed) degrades with the context increase. used llama.cpp and some nice quants better fitting for 16GB VRAM in my RTX 4080. Here is a result comparison table. Hope you find it useful. [link] [comments] |
Recently I did a little performance test of several LLMs on PC with 16GB VRAM
Reddit r/LocalLLaMA / 4/4/2026
💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- A Reddit user benchmarks multiple LLMs (Qwen 3.5, Gemma-4, Nemotron Cascade 2, and GLM 4.7 flash) on a PC with an RTX 4080 and 16GB VRAM.
- The test focuses on how inference speed degrades as the context length increases.
- They run the models using llama.cpp and use quantization choices tailored to fit within the 16GB VRAM constraint.
- A comparison result table is shared to help readers interpret relative performance across models and context sizes.



