FYI, Step 3.5 Flash has better perf and context is 1/4 the price in llama.cpp

Reddit r/LocalLLaMA / 4/14/2026

💬 OpinionSignals & Early TrendsTools & Practical Usage

Key Points

  • The Reddit post reports that updating llama.cpp/LM Studio revealed improved support for the Step 3.5 Flash model, with better performance as context length increases.
  • Step 3.5 Flash reportedly slows down about 2.5x less when loading large context and uses roughly one-quarter the memory for context compared with the prior setup.
  • The author provides benchmark examples showing higher token/sec at 170k context (75 token/sec) versus earlier performance at 96k context (45 token/sec), with the same first-prompt speed.
  • Because context memory is cheaper, the post claims users can run larger quantization variants (e.g., Q4_K_L) up to ~220k context with only about ~10% performance tradeoffs, or use parallel requests to recover throughput.
  • The author argues Step 3.5 Flash is now more practical for agent-style workflows and orchestrators (like Cline) that consume very large amounts of context.

So i recently updated LMstudio after a long pause and updated my llama.cpp runtimes too.. i was shocked.. i thought maybe something like turboquant was enabled by default.. but.. it just turns out this model's support got way better.

Step 3.5 Flash now slows down ~2.5x less as you load the context up, and uses 1/4 the memory for context!

On a mildly OC'd 5090 + RTX PRO 6000 over x8, i see this with IQ4_NL:
first prompt = 125 token/sec
170k context = 75 token/sec

Previously it was:
first prompt = 125 token/sec
96k context = 45 token/sec

Due to this context memory being 4x cheaper, i can now run Q4_K_L and still get up to 220k context.. if i'm okay with 10% less perf. Or i can setup parallel requests :)

Step 3.5 Flash is now way more useful with agents, cline, and other orchestrators that gobble up context.

submitted by /u/mr_zerolith
[link] [comments]