Hi everyone!
Very exciting times we live in where we can run models from laptops and GPU's which 4 years ago would've been SOTA.
I have been working with cloud models for years now, and I am now starting to dig into local models.
At work, I am leading a few different AI projects across the biz, and with our devs (who all love claude and have seen real value from it), our biggest pain point is the limits at the moment.
SO, I have started to have a play to see what the art of the possible is with local models. I have been keeping an eye on it for a while, but Gemma 4 peaked my interest, and then luckily the new Qwen 3.6 model popped out too.
We run MBP's for dev teams at work (mine has 48GB memory), so I am able to run the new qwen3.6-35b-a3b model at around 50 tok/s, which is great. I'd be keen to understand more from others how they are considering using these at work to bridge the gap of when claude limits cap out.
I also have a lot to learn about quant(?) and unsloth is a thing I keep seeing banded around.
[link] [comments]

