AOP-Smart: A RAG-Enhanced Large Language Model Framework for Adverse Outcome Pathway Analysis

arXiv cs.CL / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces AOP-Smart, an AOP-oriented Retrieval-Augmented Generation (RAG) framework designed to improve reliability in Adverse Outcome Pathway (AOP) question answering and mechanistic reasoning.
  • AOP-Smart uses official AOP-Wiki XML data to retrieve relevant knowledge based on Key Events (KEs), Key Event Relationships (KERs), and AOP-specific information, aiming to reduce LLM hallucinations.
  • The authors evaluate the approach on 20 AOP-related QA tasks spanning KE identification and both simple and complex retrieval across upstream/downstream relationships.
  • Experiments across Gemini, DeepSeek, and ChatGPT show large accuracy gains when using RAG versus no-RAG (e.g., GPT from 15% to 95%, DeepSeek from 35% to 100%, Gemini from 20% to 95%).

Abstract

Adverse Outcome Pathways (AOPs) are an important knowledge framework in toxicological research and risk assessment. In recent years, large language models (LLMs) have gradually been applied to AOP-related question answering and mechanistic reasoning tasks. However, due to the existence of the hallucination problem, that is, the model may generate content that is inconsistent with facts or lacks evidence, their reliability is still limited. To address this issue, this study proposes an AOP-oriented Retrieval-Augmented Generation (RAG) framework, AOP-Smart. Based on the official XML data from AOP-Wiki, this method uses Key Events (KEs), Key Event Relationships (KERs), and specific AOP information to retrieve relevant knowledge for user questions, thereby improving the reliability of the generated results of large language models. To evaluate the effectiveness of the proposed method, this study constructed a test set containing 20 AOP-related question answering tasks, covering KE identification, upstream and downstream KE retrieval, and complex AOP retrieval tasks. Experiments were conducted on three mainstream large language models, Gemini, DeepSeek, and ChatGPT, and comparative tests were performed under two settings: without RAG and with RAG. The experimental results show that, without using RAG, the accuracies of GPT, DeepSeek, and Gemini were 15.0\%, 35.0\%, and 20.0\%, respectively; after using RAG, their accuracies increased to 95.0\%, 100.0\%, and 95.0\%, respectively. The results indicate that AOP-Smart can significantly alleviate the hallucination problem of large language models in AOP knowledge tasks, and greatly improve the accuracy and consistency of their answers.