Terminal Bench score for Mistral 3.5 Medium

Reddit r/LocalLLaMA / 5/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The author reports benchmarking Mistral 3.5 Medium on TBLite to estimate how well the model handles “agentic” and tool-using behaviors, since TerminalBench 2.0 (TB2) is too resource-intensive for them.
  • They explain that while TBLite scores don’t directly map to TB2, model-to-model trends should carry over, meaning stronger TBLite performance likely indicates stronger TB2 performance.
  • Using only a single run (so results may not be fully accurate), the author shares the measured scores and compares them with other models’ reported TBLite results plus SWEBench Verified scores.
  • The results suggest that, relative to prior Mistral models, Mistral 3.5 Medium shows a substantial improvement in agent/tool-calling capability, contrasting with earlier reports that smaller Mistral variants (e.g., Mistral Small 4) performed poorly.
  • Overall, the post frames Mistral 3.5 Medium as doing “really well for its size,” with an emphasis on better performance than previous Mistral offerings in agentic loop scenarios.
Terminal Bench score for Mistral 3.5 Medium

So... there were a couple promising benchmark scores reported by mistralai in the model card for Mistral 3.5 Medium, BUT there wasn't the one that I usually care about the most, which is TerminalBench 2.0. So... since I was really curious how the new Mistral handles agentic stuff, I decided to benchmark it myself.

I didn't run TerminalBench 2.0, because I'm not crazy (usage would be biiiig), BUT I did run TBLite, which is a lighter/faster version of TerminalBench 2.0. The scores in this smaller variant don't correlate directly with TB2 scores, however the trend among models does (if a model does better than other model in TBLite, it would also do better at TerminalBench 2.0).

I did only one run, so it's not 100% accurate likely, however I decided to share the result here, since maybe someone is also curious, especially as Mistral Small 4 was... quite bad in terms of tool calling and agentic loops. Still... the result is below. I added a couple other models that have a TBLite score reported in the benchmark card + added SWEBench Verified scores for them and for GPT-5.4, Opus4.6 and GLM-5 (just to see comparison). Tbh. for it's size Mistral 3.5 Medium does really well and most of all is a big improvement when compared with previous mistralai models. (Hurray, I really cheer for Mistral)

https://preview.redd.it/bgrl55b6ocyg1.png?width=1672&format=png&auto=webp&s=a3b9a87e4bce2b1b3cb7787c377c5387a7c0a67e

submitted by /u/Real_Ebb_7417
[link] [comments]