Step 3.7 Flash open weights dropped TODAY and the agent reliability numbers are actually interesting

Reddit r/artificial / 5/29/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The release claims “Step 3.7 Flash” open weights were dropped recently and emphasizes unusually strong agent reliability across difficulty levels.
  • It reports a tau2-bench score of 98% across all difficulty tiers, contrasting with typical patterns where models perform well on easy cases but degrade on hard ones.
  • For multi-step agent chains, the writer highlights the importance of not drifting mid-chain, suggesting this release is aimed more at reliability than raw frontier capability.
  • Reported capability figures include Toolathlon (49.5) and GDPval (45.8), positioning the model as a reliability-oriented option that may be ideal for some use cases but disappointing for others.
  • The model is described as a 198B sparse MoE with 11B active parameters, 400 TPS, 256K context, Apache 2.0 licensing, and local support on M4 Max and DGX Spark.

Read this release today. Some crazy numbers.

The tau2-bench number is 98% across all difficulty levels. That is the one that got me because usually these releases post a strong easy score and then quietly die at hard difficulty. This one... claims it holds.

For multi-step agent work that actually matters more than most benchmarks. A model that drifts on step 4 of a 6 step chain is a debugging nightmare regardless of what its SWE score looks like.

Raw capability is mid, Toolathlon at 49.5, GDPval at 45.8. So this is clearly a reliability play, not a frontier capability play. Depending on your use case that is either fine or a dealbreaker.

  • 198B sparse MoE
  • 11B activ
  • 400 TPS
  • 256K context
  • Apache 2.0
  • runs locally on M4 Max and DGX Spark.

Has anyone actually put this through agent evals or am I just reading the release card.

submitted by /u/Skid_gates_99
[link] [comments]