| So... I was looking for the best local models for myself to use them in agentic coding workflows. And this is how this benchmark idea was born. And even though it's very "me-specific", I think that it might be useful for others as well, so I decided to document and publish it. The full benchmark results, methodology, visalisations etc. can be found here: https://github.com/tabupl/AdamBench README (+ prompt files in review_outputs) should provide all necessary info to replicate exactly the same benchmark flow if you want to compare the results or test other models against the ones that I tested. Also I'm totally open for recommendations of models that I could include and were not yet tested OR for recommendations regarding the methodology (check out the final parts of README, I mention what I want to improve in v2 of AdamBench) OR if you know if I can easly make use of models, that failed instantly because of issues with tools calling or chat template (looking at you Mistral Small 4). These were not included in the benchmark results at all, because I claimed them useless for local agentic coding due to the problems they generated :P What is it? AdamBench is supposed to measure the usability of models in a simple, local agentic-coding workflow. This metric synthesizes the quality score of model's solution with number of iterations AND with the time it took the model to solve the benchmark. TOP 10 (including a couple models I benchmarked over API to have comparison with the local ones) TOP 10 (just local models by AdamBench score) Scored vs AdamBench for selected local models So I really recommend you to check out my repo with the benchmark. Readme includes all measured metrics and some additional visualisations as well as my takeaways and ideas of what can be improved in AdamBench v2. https://github.com/tabupl/AdamBench The key insights:
And additionally my personal choices: TOP 1 daily driver for me: Qwen3.5 35b A3b (nice speed and good quality and leaves more space for longer context if needed due to it's size) For more complex tasks: Qwen3.5 122b A10b definitely and gpt-oss-120b is something to consider too because it's much faster (due to TPS and better tokens management) For simple tasks/fast iterations: I wanted to put Qwen3.5 9b or OmniCoder 9b, but... after thinking about it I believe that gpt-oss-20b is the best choice for me here. It's incredibly fast (170 tps generation, sic!), has superb tokens managment and just performs well. So if I had to leave just three models for myself from all the local ones I tested, it would be:
And on another note, I never want to touch Nemotron again, it's crazy inefficient (looking at you Nemotron 3 Nano with a holy 300k output tokens, that were mostly reasoning, without being able to fix Snake). If you need more info or want to check the actual results (included) or the detailed methodology or curious about how projects were reviewed by each reviewer (all review files are included as well) -> you can check out the repo. [link] [comments] |
AdamBench - a benchmark for local LLMs for agentic coding (on RTX5080 16Gb + 64Gb RAM)
Reddit r/LocalLLaMA / 3/27/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- AdamBench is a benchmark designed to measure how usable local LLMs are in a simple agentic coding workflow by combining solution quality, number of iterations, and time-to-solve into a single score.
- The author publishes full methodology, visualizations, and replicate-ready benchmark materials (prompt files and workflow) in the AdamBench GitHub repository so others can test and compare models under the same conditions.
- Results include a “Top 10” ranking for models tested locally, plus additional API-benchmarked models for comparison against local performance.
- The benchmark explicitly excludes some models that fail immediately due to tool-calling/chat-template issues, and the author invites recommendations for both new models to add and methodology improvements for a v2 iteration.
Related Articles
GDPR and AI Training Data: What You Need to Know Before Training on Personal Data
Dev.to
We built a 9-item checklist that catches LLM coding agent failures before execution starts
Dev.to
Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops
Dev.to
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to
How to Build an Automated SEO Workflow with AI: Lessons Learned from Developing SEONIB
Dev.to