Experiences with DS4 on long-lived agents

Reddit r/LocalLLaMA / 4/24/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author reports that testing DeepSeek v4 Flash in their long-lived, tool-calling agent platform significantly improved reliability for background workloads.
Tool calling is described as sharper, with the model handling complex JSON schemas natively without producing odd markdown wrappers or omitting keys.
The model reportedly maintains conversational and task “thread” over long, high-context runs involving continuous web scraping, summarization, and storage in SQLite.
The author says v4 Flash is not only better but also cheaper than DeepSeek 3.2, and that they are considering replacing Gemini 3.1 Pro for some agent tasks with v4 Pro.

Holy cow, if you guys are running background agents or heavy tool-calling pipelines, you need to test the new Deepseek v4 flash model immediately.

For context, I maintain an open-source agent platform - basically a persistent daemon that handles background python execution and SQLite state management. Because our agents run 24/7 sometimes making hundreds of tool calls an hour, API costs are usually our biggest bottleneck.

Up until yesterday, Deepseek 3.2 was our primary low-cost model. Insane price and comparable perf to SOTA models. but we just hot-swapped v4 flash into our routing, and it's kind of mind-blowing.

A couple things I'm noticing right away:

Tool calling is way sharper. It's nailing our complex JSON schemas natively without hallucinating weird markdown wrappers or dropping keys.

ALso, we do a ton of continuous context stuffing (scraping web data, summarizing it, stashing it in SQLite), and it just doesn't lose the thread even w/ high context workloads All this AND it's literally cheaper than 3.2.

We also use Gemini 3.1 pro for our agents that need the extra smarts, but v4 pro might replace that as well.

If anyone is curious about the architecture we're plugging this into, the open source repo is called Gobii. But honestly, I'm just here to validate the hype. We're making v4 flash + pro the default for our whole orchestration stack (pro for more complex workloads).

Anyone else benchmarking its JSON/tool-calling reliability yet? Curious if you're seeing the same bumps.

submitted by /u/ai-christianson
[link] [comments]