So i recently updated LMstudio after a long pause and updated my llama.cpp runtimes too.. i was shocked.. i thought maybe something like turboquant was enabled by default.. but.. it just turns out this model's support got way better.
Step 3.5 Flash now slows down ~2.5x less as you load the context up, and uses 1/4 the memory for context!
On a mildly OC'd 5090 + RTX PRO 6000 over x8, i see this with IQ4_NL:
first prompt = 125 token/sec
170k context = 75 token/sec
Previously it was:
first prompt = 125 token/sec
96k context = 45 token/sec
Due to this context memory being 4x cheaper, i can now run Q4_K_L and still get up to 220k context.. if i'm okay with 10% less perf. Or i can setup parallel requests :)
Step 3.5 Flash is now way more useful with agents, cline, and other orchestrators that gobble up context.
[link] [comments]




