AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
arXiv cs.AI / 5/4/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces AgentFloor, a deterministic 30-task benchmark that grades agent capabilities on a six-tier ladder from instruction following to long-horizon planning under persistent constraints.
- The authors evaluate 16 open-weight models (0.27B–32B parameters) and also include GPT-5, running 16,542 scored trials to test how far “small” models can go in real agent workflows.
- Results indicate there is a practical boundary: smaller/mid-sized open-weight models are already strong enough for the short-horizon, structured tool-use portion that dominates many agent pipelines.
- The strongest open-weight model overall matches GPT-5 on the benchmark while being substantially cheaper and faster, but frontier models still lead most clearly on long-horizon tasks requiring sustained coordination and reliable constraint tracking.
- The study also finds the gap is not explained by scale alone, with model-specific failures sometimes improved by targeted interventions; the authors recommend routing routine actions to smaller open-weight models and reserving frontier models for the narrower set of tasks needing deeper planning and control.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"
Reddit r/LocalLLaMA

ALM on Power Platform: ADO + GitHub, the best of both worlds
Dev.to

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️
Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?
Dev.to

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on
Dev.to