I'm building a benchmark comparing models for an agentic task. Are there any small models I should be testing that I haven't?

Reddit r/LocalLLaMA / 3/26/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author is building a constrained agentic benchmark that requires multiple LLM calls with feedback loops.
  • They are asking for recommendations of small models—especially under 10B parameters—that can perform reliable tool calling.
  • The post shares a current shortlist/plan (via an image link) of models they are already considering for comparison.
  • The goal is to gather community suggestions for additional small models worth testing in the same evaluation setup.
I'm building a benchmark comparing models for an agentic task. Are there any small models I should be testing that I haven't?

I'm working on a constrained agentic benchmark task - it requires multiple LLM calls with feedback.

Are there any good, small model I should try (or people are interested in comparing)? I'm especially interested in anything in the sub-10B range that can do reliable tool calling.

Here's what I have so far:

https://preview.redd.it/y950e4ri3erg1.png?width=2428&format=png&auto=webp&s=4c4e4000290b56e5955d8d5dc5c53e195409e866

submitted by /u/nickl
[link] [comments]