Local LLM Benchmark about Backend Generation by Function Calling (GLM vs Qwen vs DeepSeek)

Reddit r/LocalLLaMA / 5/3/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The article presents a controlled local LLM benchmark for backend code generation using function calling, comparing models such as GLM (implied), Qwen, and DeepSeek under a structured scoring rubric.
It reports that the function-calling harness has substantially narrowed the performance gap between frontier and local models on backend generation, with notable equivalences (e.g., GPT-5.4 vs Qwen3.5-35b-a3b, and Claude Sonnet vs a smaller Qwen).
The benchmark will stop including frontier models in the next iteration due to cost constraints, switching instead to cheaper OpenRouter endpoints or models runnable on a 64GB unified-memory laptop.
The next rounds will incorporate frontend automation alongside backend testing, with the expectation that an AutoBe-emitted SDK will be sufficient to generate an end-to-end working frontend.
Some counterintuitive ranking results remain under investigation, including cases where a flagship model underperforms its mini variant and where Qwen dense 27B outperforms larger MoE variants within the same family.

Local LLM Benchmark about Backend Generation by Function Calling (GLM vs Qwen vs DeepSeek)

Detailed Article: https://autobe.dev/articles/local-llm-benchmark-about-backend-generation.html

Five months ago I posted the "Hardcore function calling benchmark in backend coding agent" thread here. As I wrote in that post, it was an uncontrolled measurement — useful for showing whether each model could fill our complex recursive-union AST schemas at all, but not really a benchmark in any rigorous sense.

This post is the proper version, with controlled variables and a real scoring rubric.

Three findings worth sharing

The function calling harness has effectively closed the frontier-vs-local gap on backend generation. gpt-5.4's DB/API design ≈ qwen3.5-35b-a3b's. claude-sonnet-4.6's logic ≈ qwen3.5-27b's.
This is the last round we include frontier models. Running them every month is genuinely too expensive for an open-source project — one shopping-mall run is ~200–300M tokens (~$1,000–$1,500 per model on GPT 5.5 pricing). From next month, the comparison set is limited to OpenRouter endpoints under $0.25/M, or models that fit on a 64GB unified-memory laptop.
Frontend automation joins the benchmark in two or three months. The SDK that AutoBe already emits is enough to drive a working AI-built frontend end-to-end (visuals rough, but every function works). The June/July round will cover backend + auto-generated frontend together.

Three inversions, still investigating

A few results I'm honestly not sure how to read yet:

openai/gpt-5.4 actually scores below its own mini sibling.
deepseek-v4-pro lands one notch below qwen3.5-35b-a3b, and barely separates from its own Flash sibling.
Within the Qwen family, dense 27B beats every MoE variant — even 397B-A17B.

Two readings I want to investigate before claiming anything:

CoT-compliance phenomenon — bigger / more frontier-tier models tending to skip procedural instructions, which our harness enforces hard.
Benchmark defects — n=4 reference projects, narrow score band, our own harness scoring our own pipeline.

I'll report back in a future round once we've dug more.

Recommendations welcome

Three candidates we're locked in on so far:

openai/gpt-5.4-nano — $0.25/M
qwen/qwen3.6-27b — $0.195/M
deepseek/deepseek-v4-flash — $0.14/M

If you know other small models that meet either condition (under $0.25/M on OpenRouter, or runnable on a 64GB unified-memory laptop) and handle function calling cleanly, please drop a comment.

r/LocalLLaMA tends to spot these faster than we do, and recommendations from this thread will fill out a big chunk of next month's comparison set.

References

Benchmark Dashboard: https://autobe.dev/benchmark/
Generation Results: https://github.com/wrtnlabs/autobe-examples
Github Repository: https://github.com/wrtnlabs/autobe

submitted by /u/jhnam88
[link] [comments]

Black Hat USA

AI Business

Sparse Federated Representation Learning for deep-sea exploration habitat design in carbon-negative infrastructure

Dev.to

Building a daily AI news brief in 325 lines of Python

Dev.to

Signal Lock: Closing the Prediction-Execution Gap in Agentic AI Systems

Reddit r/artificial

VS Code Quietly Reversed Its Copilot Co-Author Default — and the Dev Community Noticed

Dev.to

Local LLM Benchmark about Backend Generation by Function Calling (GLM vs Qwen vs DeepSeek)

Key Points

Three findings worth sharing

Three inversions, still investigating

Recommendations welcome

References

Related Articles

Black Hat USA

Sparse Federated Representation Learning for deep-sea exploration habitat design in carbon-negative infrastructure

Building a daily AI news brief in 325 lines of Python

Signal Lock: Closing the Prediction-Execution Gap in Agentic AI Systems

VS Code Quietly Reversed Its Copilot Co-Author Default — and the Dev Community Noticed

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer