FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments

arXiv cs.CL / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces FAMA, a Failure-Aware Meta-Agentic framework to improve open-source LLM agents deployed for interactive, tool-using tasks in conversational, customer-like benchmarks.
  • It identifies failure trajectories from baseline agents to determine the most frequent error patterns that cause cascading breakdowns in multi-turn decision making.
  • FAMA then uses orchestration to activate only a minimal subset of specialized agents that inject targeted context into the tool-use agent before the next decision step.
  • Experiments on multiple open-source LLMs show up to 27% performance gains over standard baselines across evaluation modes.
  • The work suggests that selectively curating and injecting context to address common failures is an effective design principle for building more reliable multi-turn tool-use agents.

Abstract

Large Language Models are being increasingly deployed as the decision-making core of autonomous agents capable of effecting change in external environments. Yet, in conversational benchmarks, which simulate real-world customer-centric issue resolution scenarios, these agents frequently fail due to the cascading effects of incorrect decision-making. These challenges are particularly pronounced for open-source LLMs with smaller parameter sizes, limited context windows, and constrained inference budgets, which contribute to increased error accumulation in agentic settings. To tackle these challenges, we present the Failure-Aware Meta-Agentic (FAMA) framework. FAMA operates in two stages: first, it analyzes failure trajectories from baseline agents to identify the most prevalent errors; second, it employs an orchestration mechanism that activates a minimal subset of specialized agents tailored to address these failures by injecting a targeted context for the tool-use agent before the decision-making step. Experiments across open-source LLMs demonstrate performance gains up to 27% across evaluation modes over standard baselines. These results highlight that targeted curation of context through specialized agents to address common failures is a valuable design principle for building reliable, multi-turn tool-use LLM agents that simulate real-world conversational scenarios.