| I want to know whether GLM is another benchmark optimized model or actually useful in agents like OpenClaw, so I tested GLM 5.1 in our agentic benchmark. Turns out it reaches Opus 4.6 level performance with just 1/3 of the cost (~$0.4 per run vs ~$1.2 per run) based on my tests. It outperforms all other models tested. Pushes the cost effectiveness frontier quite a bit. I don't quite trust any static benchmarks, seen many models optimized for it, ranking high on those leaderboard but not working well in real agentic tasks. So we uses OpenClaw to test the agentic performance of models in real environment + real tasks (user submitted). Chatbot Arena/LMArena style battle, LLM as judge. Based on the result, I would say GLM 5.1 is one of the top models for OpenClaw type of agents now. Qwen 3.6 also did a good job, but it does not support prompt caching yet (on openrouter) so the current price is inflated. With prompt caching I except it to reach minimax m2.7 level cost per run and becomes another great choice for cost effectiveness. Full leaderboard, cost-effectiveness analysis, and methodology can be found at https://app.uniclaw.ai/arena?via=reddit . Strongly recommend submitting your own task and see how different models on it. [Edit 1] It seems many people confused price per token and price per task. GLM 5.1 price per token is < 1/5 of Opus. But GLM also uses about 2x token per task compared to Opus, on the same task, based on our benchmark. Reason is that GLM uses tools aggressively, more than 2x tool calls per task compared to Opus. That's why the actual cost per task is about 1/3 of Opus. [link] [comments] |
GLM 5.1 crushes every other model except Opus in agentic benchmark at about 1/3 of the Opus cost
Reddit r/LocalLLaMA / 4/11/2026
💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- A user reports testing GLM 5.1 in a real-world agentic benchmark (OpenClaw-style battles using an LLM-as-judge) and finds it matches Opus-level performance (around Opus 4.6) while costing about one-third as much per run.
- The report claims GLM 5.1 outperforms all other models tested in the same evaluation, suggesting a meaningful shift in agent-focused cost-effectiveness.
- The user emphasizes that static leaderboard benchmarks may mislead and highlights that tool-usage behavior matters, noting GLM uses roughly 2x more tokens per task than Opus due to more aggressive tool calls.
- They clarify the “one-third of Opus cost” finding is based on cost per task/run rather than token price, explaining that extra tokens are offset by cheaper token rates.
- The post also notes Qwen 3.6 as a strong alternative but says lack of prompt caching (on OpenRouter) inflates its effective cost, implying better cost competitiveness if caching becomes available.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.

