AdamBench - a benchmark for local LLMs for agentic coding (on RTX5080 16Gb + 64Gb RAM)

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • AdamBench is a benchmark designed to measure how usable local LLMs are in a simple agentic coding workflow by combining solution quality, number of iterations, and time-to-solve into a single score.
  • The author publishes full methodology, visualizations, and replicate-ready benchmark materials (prompt files and workflow) in the AdamBench GitHub repository so others can test and compare models under the same conditions.
  • Results include a “Top 10” ranking for models tested locally, plus additional API-benchmarked models for comparison against local performance.
  • The benchmark explicitly excludes some models that fail immediately due to tool-calling/chat-template issues, and the author invites recommendations for both new models to add and methodology improvements for a v2 iteration.
AdamBench - a benchmark for local LLMs for agentic coding (on RTX5080 16Gb + 64Gb RAM)

So... I was looking for the best local models for myself to use them in agentic coding workflows. And this is how this benchmark idea was born. And even though it's very "me-specific", I think that it might be useful for others as well, so I decided to document and publish it.

The full benchmark results, methodology, visalisations etc. can be found here: https://github.com/tabupl/AdamBench

README (+ prompt files in review_outputs) should provide all necessary info to replicate exactly the same benchmark flow if you want to compare the results or test other models against the ones that I tested.

Also I'm totally open for recommendations of models that I could include and were not yet tested OR for recommendations regarding the methodology (check out the final parts of README, I mention what I want to improve in v2 of AdamBench) OR if you know if I can easly make use of models, that failed instantly because of issues with tools calling or chat template (looking at you Mistral Small 4). These were not included in the benchmark results at all, because I claimed them useless for local agentic coding due to the problems they generated :P

What is it?

AdamBench is supposed to measure the usability of models in a simple, local agentic-coding workflow. This metric synthesizes the quality score of model's solution with number of iterations AND with the time it took the model to solve the benchmark.

TOP 10 (including a couple models I benchmarked over API to have comparison with the local ones)

https://preview.redd.it/wpvl750c5grg1.png?width=2830&format=png&auto=webp&s=568f15ce4db558c4548fba351ae8538006a364b6

TOP 10 (just local models by AdamBench score)

https://preview.redd.it/b6nhzfgf5grg1.png?width=3179&format=png&auto=webp&s=24b46450a3c6d9fd2c4ea60572290dc38d52e9f0

Scored vs AdamBench for selected local models

https://preview.redd.it/yrhzdwvj5grg1.png?width=2779&format=png&auto=webp&s=d3ba86d0b4707dacc701f739e8ee314660be80ea

So I really recommend you to check out my repo with the benchmark. Readme includes all measured metrics and some additional visualisations as well as my takeaways and ideas of what can be improved in AdamBench v2.

https://github.com/tabupl/AdamBench

The key insights:

  • The TOP 1 winner of the main benchmark metric (AdamBench) is Qwen3.5 122b A10b
  • If you're looking for a smaller model though, the TOP 3 of all tested local models was achieved by Qwen3.5 35b A3b
  • And if 35b is still too big, Qwen3.5 9b scored an astonishing TOP 7, outperforming many way bigger models.
  • The biggest positive surprise for me was the performance of gpt-oss-120b (TOP 2) and gpt-oss-20b (TOP 5). They both scored pretty well, but most importantly they are super fast for their sizes and at the same time they waste way less tokens than other models to perform a task.
  • The biggest disappointment for me were Nemotron models, that performed quite bad quality-wise, they were slow and they generated unreasonable amount of tokens (that were mostly reasoning). Nemotron 3 Super, the highest rated model from this familiy ended at TOP 10 spot, outperformed even at bare quality metrics by much smaller models.

And additionally my personal choices:

TOP 1 daily driver for me: Qwen3.5 35b A3b (nice speed and good quality and leaves more space for longer context if needed due to it's size)

For more complex tasks: Qwen3.5 122b A10b definitely and gpt-oss-120b is something to consider too because it's much faster (due to TPS and better tokens management)

For simple tasks/fast iterations: I wanted to put Qwen3.5 9b or OmniCoder 9b, but... after thinking about it I believe that gpt-oss-20b is the best choice for me here. It's incredibly fast (170 tps generation, sic!), has superb tokens managment and just performs well.

So if I had to leave just three models for myself from all the local ones I tested, it would be:

  • Qwen3.5 35b A3b
  • Qwen3.5 122b A10b
  • gpt-oss-20b

And on another note, I never want to touch Nemotron again, it's crazy inefficient (looking at you Nemotron 3 Nano with a holy 300k output tokens, that were mostly reasoning, without being able to fix Snake).

If you need more info or want to check the actual results (included) or the detailed methodology or curious about how projects were reviewed by each reviewer (all review files are included as well) -> you can check out the repo.

submitted by /u/Real_Ebb_7417
[link] [comments]