Agentic Frameworks for Reasoning Tasks: An Empirical Study
arXiv cs.AI / 4/21/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The study empirically compares 22 popular agentic frameworks on three reasoning benchmarks (BBH, GSM8K, and ARC) using a unified evaluation setup, assessing accuracy, execution time, computational cost, and cross-benchmark consistency.
- Nineteen of the 22 frameworks successfully completed all three benchmarks, and 12 of them achieved stable performance with mean accuracy around 74.6–75.9%, running time of 4–6 seconds per task, and about 0.14–0.18 cents per task.
- The main drivers of weaker performance were orchestration issues rather than inherent reasoning limitations, including uncontrolled context/memory growth (e.g., Camel), costly retry loops from extraction failures (e.g., Upsonic), and API quota exhaustion from iterative interactions that increased prompt length (e.g., AutoGen, Mastra).
- Mathematical reasoning performance was notably lower: GSM8K mean accuracy was 44.35%, versus about 89.8% on BBH and 89.56% on ARC, indicating benchmark-dependent difficulty.
- The authors conclude that selecting an agentic framework for reasoning-heavy software engineering should prioritize orchestration quality—especially memory control, failure handling, and cost management.
Related Articles

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.
Reddit r/artificial

Why I Built byCode: A 100% Local, Privacy-First AI IDE
Dev.to

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs
The Register

ETHENEA (ETHENEA Americas LLC) Analyst View: Asset Allocation Resilience in the 2026 Global Macro Cycle
Dev.to

Blaze Balance Engine SaaS
Dev.to