Just some quick thoughts on Qwen3.5-35B-A3B-UD-IQ4_XS after I finally got it working in the new version of Ooba. In short: on a 3090, this thing runs at around 100 t/s with almost no preprocessing time, and it can fit like a 250k context length on the card with no cache quantization. Actual performance is quite good. I always make a quick demo and chuck it on Codepen, and I've been trying and failing to make a basic 3D snake game in ThreeJS with a local model until now.
This sort of thing should be easy, but lots of models refused to make changes without breaking the entire thing, even if I tried reprompting them with a fresh context and as many pointers as I could easily provide. This model was different, though. It made a few mistakes, and it had to spend a while thinking at times, but it actually fixed shit and delivered a working product. I think the best you can hope for with a tiny model is strong competence at following directions and properly executing on a fairly well-defined goal, and this model seems to do that well. I have yet to try it out with Cline, but I suspect it will do fairly well in a proper agentic workflow. Cline is sort of a menace when it comes to hogging context, so I suspect it will be a good pairing with a local model that is competent, really fast, and can fit a huge unquantized context on the GPU.
[link] [comments]




