Local LLM Benchmark about Backend Generation by Function Calling (GLM vs Qwen vs DeepSeek)

Reddit r/LocalLLaMA / 5/3/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The article presents a controlled local LLM benchmark for backend code generation using function calling, comparing models such as GLM (implied), Qwen, and DeepSeek under a structured scoring rubric.
  • It reports that the function-calling harness has substantially narrowed the performance gap between frontier and local models on backend generation, with notable equivalences (e.g., GPT-5.4 vs Qwen3.5-35b-a3b, and Claude Sonnet vs a smaller Qwen).
  • The benchmark will stop including frontier models in the next iteration due to cost constraints, switching instead to cheaper OpenRouter endpoints or models runnable on a 64GB unified-memory laptop.
  • The next rounds will incorporate frontend automation alongside backend testing, with the expectation that an AutoBe-emitted SDK will be sufficient to generate an end-to-end working frontend.
  • Some counterintuitive ranking results remain under investigation, including cases where a flagship model underperforms its mini variant and where Qwen dense 27B outperforms larger MoE variants within the same family.
Local LLM Benchmark about Backend Generation by Function Calling (GLM vs Qwen vs DeepSeek)

Detailed Article: https://autobe.dev/articles/local-llm-benchmark-about-backend-generation.html


Five months ago I posted the "Hardcore function calling benchmark in backend coding agent" thread here. As I wrote in that post, it was an uncontrolled measurement — useful for showing whether each model could fill our complex recursive-union AST schemas at all, but not really a benchmark in any rigorous sense.

This post is the proper version, with controlled variables and a real scoring rubric.

Three findings worth sharing

  1. The function calling harness has effectively closed the frontier-vs-local gap on backend generation. gpt-5.4's DB/API design ≈ qwen3.5-35b-a3b's. claude-sonnet-4.6's logic ≈ qwen3.5-27b's.

  2. This is the last round we include frontier models. Running them every month is genuinely too expensive for an open-source project — one shopping-mall run is ~200–300M tokens (~$1,000–$1,500 per model on GPT 5.5 pricing). From next month, the comparison set is limited to OpenRouter endpoints under $0.25/M, or models that fit on a 64GB unified-memory laptop.

  3. Frontend automation joins the benchmark in two or three months. The SDK that AutoBe already emits is enough to drive a working AI-built frontend end-to-end (visuals rough, but every function works). The June/July round will cover backend + auto-generated frontend together.

Three inversions, still investigating

A few results I'm honestly not sure how to read yet:

  • openai/gpt-5.4 actually scores below its own mini sibling.
  • deepseek-v4-pro lands one notch below qwen3.5-35b-a3b, and barely separates from its own Flash sibling.
  • Within the Qwen family, dense 27B beats every MoE variant — even 397B-A17B.

Two readings I want to investigate before claiming anything:

  1. CoT-compliance phenomenon — bigger / more frontier-tier models tending to skip procedural instructions, which our harness enforces hard.
  2. Benchmark defects — n=4 reference projects, narrow score band, our own harness scoring our own pipeline.

I'll report back in a future round once we've dug more.

Recommendations welcome

Three candidates we're locked in on so far:

  • openai/gpt-5.4-nano — $0.25/M
  • qwen/qwen3.6-27b — $0.195/M
  • deepseek/deepseek-v4-flash — $0.14/M

If you know other small models that meet either condition (under $0.25/M on OpenRouter, or runnable on a 64GB unified-memory laptop) and handle function calling cleanly, please drop a comment.

r/LocalLLaMA tends to spot these faster than we do, and recommendations from this thread will fill out a big chunk of next month's comparison set.

References

submitted by /u/jhnam88
[link] [comments]