GLM 5.1 crushes every other model except Opus in agentic benchmark at about 1/3 of the Opus cost

Reddit r/LocalLLaMA / 4/11/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

A user reports testing GLM 5.1 in a real-world agentic benchmark (OpenClaw-style battles using an LLM-as-judge) and finds it matches Opus-level performance (around Opus 4.6) while costing about one-third as much per run.
The report claims GLM 5.1 outperforms all other models tested in the same evaluation, suggesting a meaningful shift in agent-focused cost-effectiveness.
The user emphasizes that static leaderboard benchmarks may mislead and highlights that tool-usage behavior matters, noting GLM uses roughly 2x more tokens per task than Opus due to more aggressive tool calls.
They clarify the “one-third of Opus cost” finding is based on cost per task/run rather than token price, explaining that extra tokens are offset by cheaper token rates.
The post also notes Qwen 3.6 as a strong alternative but says lack of prompt caching (on OpenRouter) inflates its effective cost, implying better cost competitiveness if caching becomes available.

GLM 5.1 crushes every other model except Opus in agentic benchmark at about 1/3 of the Opus cost

https://preview.redd.it/s9lg647zjeug1.png?width=1161&format=png&auto=webp&s=4d0c361b5fbee97e4084e2d48543cafbc299ce25

I want to know whether GLM is another benchmark optimized model or actually useful in agents like OpenClaw, so I tested GLM 5.1 in our agentic benchmark.

Turns out it reaches Opus 4.6 level performance with just 1/3 of the cost (~$0.4 per run vs ~$1.2 per run) based on my tests. It outperforms all other models tested. Pushes the cost effectiveness frontier quite a bit.

I don't quite trust any static benchmarks, seen many models optimized for it, ranking high on those leaderboard but not working well in real agentic tasks. So we uses OpenClaw to test the agentic performance of models in real environment + real tasks (user submitted). Chatbot Arena/LMArena style battle, LLM as judge.

Based on the result, I would say GLM 5.1 is one of the top models for OpenClaw type of agents now.

Qwen 3.6 also did a good job, but it does not support prompt caching yet (on openrouter) so the current price is inflated. With prompt caching I except it to reach minimax m2.7 level cost per run and becomes another great choice for cost effectiveness.

Full leaderboard, cost-effectiveness analysis, and methodology can be found at https://app.uniclaw.ai/arena?via=reddit . Strongly recommend submitting your own task and see how different models on it.

[Edit 1]

It seems many people confused price per token and price per task.

GLM 5.1 price per token is < 1/5 of Opus. But GLM also uses about 2x token per task compared to Opus, on the same task, based on our benchmark. Reason is that GLM uses tools aggressively, more than 2x tool calls per task compared to Opus. That's why the actual cost per task is about 1/3 of Opus.

submitted by /u/zylskysniper
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/11DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

Research with ChatGPT

Dev.to

Silicon Valley is quietly running on Chinese open source models and almost nobody is talking about it

Reddit r/LocalLLaMA

Why AI Product Quality Is Now an Evaluation Pipeline Problem, Not a Model Problem

Dev.to

GLM 5.1 crushes every other model except Opus in agentic benchmark at about 1/3 of the Opus cost

Key Points

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

Research with ChatGPT

Silicon Valley is quietly running on Chinese open source models and almost nobody is talking about it

Why AI Product Quality Is Now an Evaluation Pipeline Problem, Not a Model Problem

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer