What a time to be alive from 1tk/sec to 20-100tk/sec for huge models

Reddit r/LocalLLaMA / 5/4/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageIndustry & Market Moves

Key Points

  • The post argues that local inference speed has dramatically improved, with hardware once enabling Llama 405B at about 1.2 tk/sec now reportedly running much larger state-of-the-art models at roughly 30–100 tk/sec.
  • It cites examples of running big models (e.g., Kimik2.6, DeepSeekV4Flash, Minimax2.7, Step3.5Flash, Qwen3.5-397B) locally, claiming substantial performance gains over older models.
  • The author reflects on earlier skepticism about running slower models, saying they experimented to be prepared for very advanced AI/AGI scenarios.
  • The post also highlights that, for a few hundred dollars, users can run smaller large language models (e.g., Qwen3.6-36B) at very high throughput (~50 tk/sec) at home.
  • Overall, it encourages local LLaMA enthusiasts to keep experimenting and dismiss critics, framing ongoing progress in local AI as evidence that the experiments are paying off.

https://www.reddit.com/r/LocalLLaMA/comments/1eb6to7/llama_405b_q4_k_m_quantization_running_locally/

https://www.reddit.com/r/LocalLLaMA/comments/1ebbgkr/llama_31_405b_q5_k_m_running_on_amd_epyc_9374f/

Llama405b q4 at 1.2tk/sec 2 years ago was something to be excited about.

That same hardware will now run HUGE state of the art models (kimik2.6, deepseekv4flash, minimax2.7, step3.5flash, qwen3.5-397b) at 30tk-100tk/sec while crushing llama405b. :-/

I recall folks asking why anyone would want to run Llama405b at 1.2/tk, etc. My answer when folks asked me was that I wanted to be ready for when AGI arrived. If it meant being able to run my own super AI at 1tk/sec I wanted that option. It turned out better than I could have ever imagined, we do have super AGI and we can run them cheap and fast.

Putting aside the huge models, for a few hundred $ you could run qwen3.6-36b at 50tk/sec at home. So to my fellow local llama nuts, stay crazy, keep experimenting, ignore the naysayers, all the "stupid", "waste of time" experiments are paying off.

submitted by /u/segmond
[link] [comments]