Step 3.7 Flash open weights dropped TODAY and the agent reliability numbers are actually interesting

Reddit r/artificial / 5/29/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The release claims “Step 3.7 Flash” open weights were dropped recently and emphasizes unusually strong agent reliability across difficulty levels.
It reports a tau2-bench score of 98% across all difficulty tiers, contrasting with typical patterns where models perform well on easy cases but degrade on hard ones.
For multi-step agent chains, the writer highlights the importance of not drifting mid-chain, suggesting this release is aimed more at reliability than raw frontier capability.
Reported capability figures include Toolathlon (49.5) and GDPval (45.8), positioning the model as a reliability-oriented option that may be ideal for some use cases but disappointing for others.
The model is described as a 198B sparse MoE with 11B active parameters, 400 TPS, 256K context, Apache 2.0 licensing, and local support on M4 Max and DGX Spark.

Read this release today. Some crazy numbers.

The tau2-bench number is 98% across all difficulty levels. That is the one that got me because usually these releases post a strong easy score and then quietly die at hard difficulty. This one... claims it holds.

For multi-step agent work that actually matters more than most benchmarks. A model that drifts on step 4 of a 6 step chain is a debugging nightmare regardless of what its SWE score looks like.

Raw capability is mid, Toolathlon at 49.5, GDPval at 45.8. So this is clearly a reliability play, not a frontier capability play. Depending on your use case that is either fine or a dealbreaker.

198B sparse MoE
11B activ
400 TPS
256K context
Apache 2.0
runs locally on M4 Max and DGX Spark.

Has anyone actually put this through agent evals or am I just reading the release card.

submitted by /u/Skid_gates_99
[link] [comments]

Black Hat USA

AI Business

Gemini Flash Gets Pricey, AI Act Delays, Agents Drive Online Traffic

The Batch

Which one has the most chance of open-sourcing old 2020-2024 AI models? OpenAI, Google or Antrophic? Why? Tell also a model that would open source (ya select only one old model)

Reddit r/LocalLLaMA

No Template Fits? Generate Your Own Awesome DESIGN.md with .NET and Ollama

Dev.to

Mask-CNN: Localizing Parts and Selecting Descriptors for Fine-Grained ImageRecognition

Dev.to

Step 3.7 Flash open weights dropped TODAY and the agent reliability numbers are actually interesting

Key Points

Related Articles

Black Hat USA

Gemini Flash Gets Pricey, AI Act Delays, Agents Drive Online Traffic

Which one has the most chance of open-sourcing old 2020-2024 AI models? OpenAI, Google or Antrophic? Why? Tell also a model that would open source (ya select only one old model)

No Template Fits? Generate Your Own Awesome DESIGN.md with .NET and Ollama

Mask-CNN: Localizing Parts and Selecting Descriptors for Fine-Grained ImageRecognition

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer