AgentFixer: From Failure Detection to Fix Recommendations in LLM Agentic Systems

arXiv cs.AI / 4/1/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • AgentFixer is introduced as a validation framework for LLM-based agentic systems, combining fifteen failure-detection tools with two root-cause analysis modules to diagnose reliability failures systematically.
  • The framework targets weaknesses across input handling, prompt design, and output generation, using a mix of lightweight rule checks and LLM-as-a-judge assessments for incident detection, classification, and repair.
  • Applied to IBM CUGA and evaluated on AppWorld and WebArena, the approach identified recurrent issues such as planner misalignments, schema violations, and brittle prompt dependencies.
  • Using the diagnostics, the authors refined prompting and coding strategies, improving performance while preserving CUGA benchmark results and enabling mid-sized models (e.g., Llama 4 and Mistral Medium) to narrow the accuracy gap with frontier models.
  • The work also explores an agentic validation loop where diagnostic outputs are fed into an LLM for self-reflection and prioritization, moving validation toward a dialogue-driven, self-improving process for production use.

Abstract

We introduce a comprehensive validation framework for LLM-based agentic systems that provides systematic diagnosis and improvement of reliability failures. The framework includes fifteen failure-detection tools and two root-cause analysis modules that jointly uncover weaknesses across input handling, prompt design, and output generation. It integrates lightweight rule-based checks with LLM-as-a-judge assessments to support structured incident detection, classification, and repair. We applied the framework to IBM CUGA, evaluating its performance on the AppWorld and WebArena benchmarks. The analysis revealed recurrent planner misalignments, schema violations, brittle prompt dependencies, and more. Based on these insights, we refined both prompting and coding strategies, maintaining CUGA's benchmark results while enabling mid-sized models such as Llama 4 and Mistral Medium to achieve notable accuracy gains, substantially narrowing the gap with frontier models. Beyond quantitative validation, we conducted an exploratory study that fed the framework's diagnostic outputs and agent description into an LLM for self-reflection and prioritization. This interactive analysis produced actionable insights on recurring failure patterns and focus areas for improvement, demonstrating how validation itself can evolve into an agentic, dialogue-driven process. These results show a path toward scalable, quality assurance, and adaptive validation in production agentic systems, offering a foundation for more robust, interpretable, and self-improving agentic architectures.