https://www.reddit.com/r/LocalLLaMA/comments/1eb6to7/llama_405b_q4_k_m_quantization_running_locally/
https://www.reddit.com/r/LocalLLaMA/comments/1ebbgkr/llama_31_405b_q5_k_m_running_on_amd_epyc_9374f/
Llama405b q4 at 1.2tk/sec 2 years ago was something to be excited about.
That same hardware will now run HUGE state of the art models (kimik2.6, deepseekv4flash, minimax2.7, step3.5flash, qwen3.5-397b) at 30tk-100tk/sec while crushing llama405b. :-/
I recall folks asking why anyone would want to run Llama405b at 1.2/tk, etc. My answer when folks asked me was that I wanted to be ready for when AGI arrived. If it meant being able to run my own super AI at 1tk/sec I wanted that option. It turned out better than I could have ever imagined, we do have super AGI and we can run them cheap and fast.
Putting aside the huge models, for a few hundred $ you could run qwen3.6-36b at 50tk/sec at home. So to my fellow local llama nuts, stay crazy, keep experimenting, ignore the naysayers, all the "stupid", "waste of time" experiments are paying off.
[link] [comments]




