Testing Local LLMs in Practice: Code Generation, Quality vs. Speed

Reddit r/LocalLLaMA / 5/9/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The post describes building an AI agent that autonomously generates Go code using local LLMs, with the main target being log parser generation for SIEM pipelines.
  • A major focus of the work was creating an objective evaluation approach for autonomous coding usefulness, rather than relying on subjective impressions.
  • The author developed a benchmarking harness that generates real Go parsers, compiles them, validates extracted fields and types, and compares parsing quality against expected schemas.
  • The harness also tracks throughput and speed over longer runs to study quality-versus-performance tradeoffs.
  • The author published a public first version of the benchmark and methodology and asks for feedback and suggestions on which model to test next.
Testing Local LLMs in Practice: Code Generation, Quality vs. Speed

Hello,

I spent the last few months building an AI agent that autonomously writes Go code using local LLMs. The primary use case is log parser generation for SIEM pipelines.

A large part of the work ended up being evaluation itself: how do you objectively measure whether a model is actually useful for autonomous coding tasks?

So I built a harness that (1) lets agents generate real Go parsers, (2) compiles the Go code, (3) validates extracted fields and types, (4) measures parsing quality against expected schemas, (5) and tracks throughput/speed over longer runs.

Given the current release cadence of open-weight models, the results are interesting.

I published the first public version of the benchmark and methodology here:
https://ndocs.teskalabs.com/logman.io/blog/2026/04/14/testing-local-llms-in-practice-code-generation-quality-vs-speed/

Feedback is very welcome.
Also: which model should I test next?

submitted by /u/Icy_Programmer7186
[link] [comments]