FGDM: Reasoning Aware Multi-Agentic Framework for Software Bug Detection using Chain of Thought and Tree of Thought Prompting

arXiv cs.LG / 4/29/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper argues that existing deep-learning approaches for software bug detection struggle with the “global” context of code, leading to performance drops on large, interconnected, or modular codebases.
  • It proposes FGDM, a Flow-Graph-Driven Multi-Agent framework that converts code into a flow graph, detects erroneous segments, and generates repaired code using four sequential agents.
  • FGDM relies on Chain-of-Thought and Tree-of-Thought prompting across its agents to improve reasoning over dependencies among modules.
  • The system integrates FAISS to retrieve similar past bugs and their repairs, using retrieved examples to support the repair process.
  • Experiments on 100+ programs across popular projects (e.g., Pandas, FastAPI, Matplotlib, Scrapy) show FGDM outperforming prior methods, achieving mean Levenshtein reductions of 24.33 (Python) and 8.37 (C) and high cosine similarity scores (0.951 Python, 0.974 C).

Abstract

Deep Learning methods are becoming prominent in automated software bug detection; however, they lack the global understanding of the given code. Consequently, their performance tends to degrade, especially when they are applied to large interconnected code bases or complex modular programs. Recently, Large Language Models (LLMs) have proven to be effective at capturing dependencies among multiple interconnected modules in the codebase. This motivated us to propose the Flow-Graph-Driven Multi-Agent Framework (FGDM), which is composed of four agents that operate in a sequential manner. The framework converts the received code to a flow graph, identifies the erroneous segments, and further generates the repaired code. All the employed agents utilize Chain-of-Thought (COT) and Tree-of-Thoughts (TOT) prompts. Additionally, we also integrated with the FAISS vector database to retrieve similar previous bugs and their repairs. We demonstrated the efficacy of the proposed framework over 100 programs from several projects, including Ansible, Black, FastAPI, Keras, Luigi, Matplotlib, Pandas, Scrapy, SpaCy, and Tornado in both C and Python programs. Our experiments demonstrate that the FGDM outperforms the extant approaches and yielded reductions with a mean of 24.33 and 8.37 in Levenshtein distance and similarities of 0.951 and 0.974 in cosine similarity for Python and C, respectively.