Which model should I try?

Reddit r/LocalLLaMA / 5/3/2026

💬 OpinionSignals & Early TrendsTools & Practical Usage

Key Points

  • The author is looking for suggestions on which additional LLMs to try for a workflow that includes coding in Python/C++ and writing technical reports.
  • They currently use Qwen3.6 27B and Gemma4 31B, and previously tested Deepseek but found it too slow for their real-world usage.
  • They clarify they are not asking how to speed up models; instead, they want recommendations for other models that may better match their throughput constraints.
  • Their available hardware includes MI50 32GB and V100 32GB, and they consider responses below 10 tokens per second to be unacceptably slow.
  • They also indicate they already downscale via quantization or smaller variants when VRAM is insufficient, and they abandon models when latency is too high.

In my current workflow (coding in python/c++ and technical reports) I mostly use Qwen3.6 27B and Gemma4 31B. In the past I tried other models like Deepseek with decent results but was painfully slow.... so do you think there is some model that I'm missing and should try?

EDIT: to be clear, I'm not asking how to make those models run faster, I'm asking which other models I should try. Telling me to try them all doesn't help, first because there are a bazillion models available and nobody on earth could reasonably try them all, and second if I were willing to try them all I wouldn't have asked here. If I see the model using more VRAM than avalilable I already scale down, either on the quantization or on the model itself if possible, or I abandon the model because it's too slow.

System specs: MI50 32GB + V100 32GB. And going below 10tps on real world usage is "painfully slow".

submitted by /u/WhatererBlah555
[link] [comments]