AutoBe benchmark: structured harness narrows frontier-vs-local gap in backend generation [D]

Reddit r/MachineLearning / 5/4/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • AutoBe is an end-to-end backend generation benchmark where one natural-language request yields requirements, ERD, an OpenAPI spec, E2E tests, a NestJS implementation, and a type-safe SDK.
  • The benchmark uses a structured AST-filling approach via function calling, and scores are based entirely on static analysis for consistent results across reruns.
  • Model scores cluster tightly: GLM 5 leads, qwen3.5-27b is close behind, and several local models achieve enterprise-scale backends with 100% compile success.
  • The authors argue that with a sufficiently structured harness, backend-generation quality is constrained more by the benchmark harness design than by raw model prestige.
  • Cost is a major factor: frontier-priced runs cost about $1,000–$1,500 per model, and the next round plans to focus on cheaper or laptop-runnable models; the results are limited by using only four reference projects and may favor models that follow function-calling procedures well.

AutoBe is a benchmark for end-to-end backend generation. One natural language request produces six outputs: requirements analysis, ERD, OpenAPI spec, E2E tests, NestJS implementation, and a type-safe SDK. Each phase fills a predefined AST via structured function calling rather than generating unstructured code. The scoring rubric is 100 points driven entirely by static analysis - the same artifact scores the same regardless of who reruns it.

The headline finding is that scores cluster tightly. GLM 5 tops the benchmark run. qwen3.5-27b sits directly behind frontier models. Several local models produced enterprise-scale backends with 100% compile success. The author's interpretation: once the harness is structured, backend-generation quality is constrained more by harness design than by model prestige.

The cost contrast is significant. A full benchmark run at frontier pricing ($5/M input tokens) runs $1,000-$1,500 per model. The next benchmark round plans to filter to models at $0.25/M input or runnable on a 64GB unified-memory laptop - which would include most of the models that clustered near the top anyway.

The honest caveat from the author: this uses four reference projects and may favor models that comply well with procedural function-calling instructions. How well these results generalize beyond well-structured benchmark fixtures is still an open question.

Does your experience with structured function-calling in production tasks align with benchmark findings like these?

submitted by /u/jimmytoan
[link] [comments]