EventADL: Open-Box Anomaly Detection and Localization Framework for Events in Cloud-Based Service Systems

arXiv cs.LG / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces EventADL, an open-box, event-based anomaly detection and localization framework aimed at filling a gap where prior ADL work focused mainly on metrics and logs rather than event data.
  • Using a systematic analysis of 520 real-world incidents, it characterizes how anomalies and their underlying root causes appear in event streams.
  • EventADL operates in three phases—offline training, online detection, and root-cause localization—by learning Event Semantic Patterns (normal entity interactions) and Event Frequency Patterns (normal occurrence rates), then flagging deviations during online detection.
  • For explainability and automation in root-cause finding, it builds an Intervention Graph linking recent interactions to detected anomalies to localize likely causes.
  • Experiments on three cloud service systems and two real incidents show strong results, including F1-scores of at least 90% for detection and 100% top-3 accuracy for localization, outperforming existing approaches.

Abstract

Anomaly detection and localization (ADL) is critical for maintaining reliability and availability in cloud systems. Recent ADL developments focus on metric and log data, leaving event data unexplored. To address this gap, we propose EventADL, the first open-box event-based ADL framework for cloud-based service systems. To motivate the design of our framework, we conduct a systematic analysis on 520 real-world incidents, and provide insights into how anomalies and their root causes manifest through event data. EventADL has three phases: offline training, online anomaly detection, and root cause localization. During the training phase, EventADL first learns Event Semantic Patterns (ESPs), which capture normal interactions between system entities using historical event data, and then learns Event Frequency Patterns (EFPs), which capture the normal frequency of known ESPs. In the online anomaly detection phase, any data in the event stream that deviates significantly from either pattern is identified as anomalous. For localization, EventADL constructs an Intervention Graph that models the relationships between recent system interactions and the detected anomalies for automatic root cause localization. The framework is designed to operate efficiently with unlabeled data and to produce interpretable anomalies with their corresponding root causes. Our evaluation on three real cloud service systems and two real-world incidents demonstrates that EventADL outperforms existing methods, achieving F1-scores of at least 90% for anomaly detection and 100% top-3 accuracy in root cause localization.