| So... there were a couple promising benchmark scores reported by mistralai in the model card for Mistral 3.5 Medium, BUT there wasn't the one that I usually care about the most, which is TerminalBench 2.0. So... since I was really curious how the new Mistral handles agentic stuff, I decided to benchmark it myself. I didn't run TerminalBench 2.0, because I'm not crazy (usage would be biiiig), BUT I did run TBLite, which is a lighter/faster version of TerminalBench 2.0. The scores in this smaller variant don't correlate directly with TB2 scores, however the trend among models does (if a model does better than other model in TBLite, it would also do better at TerminalBench 2.0). I did only one run, so it's not 100% accurate likely, however I decided to share the result here, since maybe someone is also curious, especially as Mistral Small 4 was... quite bad in terms of tool calling and agentic loops. Still... the result is below. I added a couple other models that have a TBLite score reported in the benchmark card + added SWEBench Verified scores for them and for GPT-5.4, Opus4.6 and GLM-5 (just to see comparison). Tbh. for it's size Mistral 3.5 Medium does really well and most of all is a big improvement when compared with previous mistralai models. (Hurray, I really cheer for Mistral) [link] [comments] |
Terminal Bench score for Mistral 3.5 Medium
Reddit r/LocalLLaMA / 5/1/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The author reports benchmarking Mistral 3.5 Medium on TBLite to estimate how well the model handles “agentic” and tool-using behaviors, since TerminalBench 2.0 (TB2) is too resource-intensive for them.
- They explain that while TBLite scores don’t directly map to TB2, model-to-model trends should carry over, meaning stronger TBLite performance likely indicates stronger TB2 performance.
- Using only a single run (so results may not be fully accurate), the author shares the measured scores and compares them with other models’ reported TBLite results plus SWEBench Verified scores.
- The results suggest that, relative to prior Mistral models, Mistral 3.5 Medium shows a substantial improvement in agent/tool-calling capability, contrasting with earlier reports that smaller Mistral variants (e.g., Mistral Small 4) performed poorly.
- Overall, the post frames Mistral 3.5 Medium as doing “really well for its size,” with an emphasis on better performance than previous Mistral offerings in agentic loop scenarios.
Related Articles

Red-teaming a network of agents: Understanding what breaks when AI agents interact at scale
Microsoft Research Blog
langchain-fireworks==1.2.1
LangChain Releases

How PolySignals Works: Full Breakdown of Its AI Signal Engine
Dev.to

AI-Powered Prediction Market Signals: The Complete Polymarket Trading Guide for 2026
Dev.to

AI Agent Orchestration & Applied LLMs: Code Search, Workflow Optimization, Document Processing
Dev.to