Im curious how much output token benefits from something smaller like a 12gb Tesla T4, and offloading the remainder of the model to RAM
I get about ~1.6t/s output ~20t/s input CPU only.. which is obviously terrible. I'm using NUMA.. I have dual xeon platinum 24c(so 48c/96t) and 1.5T of RAM
Strangely enough, the Q8 model from un sloth, run slightly faster than the Q4 model on my system
[link] [comments]




