FYI, Step 3.5 Flash has better perf and context is 1/4 the price in llama.cpp

Reddit r/LocalLLaMA / 4/14/2026

💬 OpinionSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The Reddit post reports that updating llama.cpp/LM Studio revealed improved support for the Step 3.5 Flash model, with better performance as context length increases.
Step 3.5 Flash reportedly slows down about 2.5x less when loading large context and uses roughly one-quarter the memory for context compared with the prior setup.
The author provides benchmark examples showing higher token/sec at 170k context (75 token/sec) versus earlier performance at 96k context (45 token/sec), with the same first-prompt speed.
Because context memory is cheaper, the post claims users can run larger quantization variants (e.g., Q4_K_L) up to ~220k context with only about ~10% performance tradeoffs, or use parallel requests to recover throughput.
The author argues Step 3.5 Flash is now more practical for agent-style workflows and orchestrators (like Cline) that consume very large amounts of context.

So i recently updated LMstudio after a long pause and updated my llama.cpp runtimes too.. i was shocked.. i thought maybe something like turboquant was enabled by default.. but.. it just turns out this model's support got way better.

Step 3.5 Flash now slows down ~2.5x less as you load the context up, and uses 1/4 the memory for context!

On a mildly OC'd 5090 + RTX PRO 6000 over x8, i see this with IQ4_NL:
first prompt = 125 token/sec
170k context = 75 token/sec

Previously it was:
first prompt = 125 token/sec
96k context = 45 token/sec

Due to this context memory being 4x cheaper, i can now run Q4_K_L and still get up to 220k context.. if i'm okay with 10% less perf. Or i can setup parallel requests :)

Step 3.5 Flash is now way more useful with agents, cline, and other orchestrators that gobble up context.

submitted by /u/mr_zerolith
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/14DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

Agentic coding at enterprise scale demands spec-driven development

VentureBeat

How to build effective reward functions with AWS Lambda for Amazon Nova model customization

Amazon AWS AI Blog

How 25 Students Went from Idea to Deployed App in 2 Hours with Google Antigravity

Dev.to

FYI, Step 3.5 Flash has better perf and context is 1/4 the price in llama.cpp

Key Points

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

Agentic coding at enterprise scale demands spec-driven development

How to build effective reward functions with AWS Lambda for Amazon Nova model customization

How 25 Students Went from Idea to Deployed App in 2 Hours with Google Antigravity

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer