LLM-Enhanced Log Anomaly Detection: A Comprehensive Benchmark of Large Language Models for Automated System Diagnostics
arXiv cs.LG / 4/15/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper introduces a comprehensive benchmark comparing LLM-based and traditional methods for system log anomaly detection across four public datasets (HDFS, BGL, Thunderbird, Spirit).
- It evaluates three method families: classical log parsers plus ML classifiers, fine-tuned transformer models (BERT/RoBERTa), and prompt-based LLM approaches (GPT-3.5, GPT-4, LLaMA-3) in zero-shot and few-shot settings.
- Fine-tuned transformers deliver the best accuracy, reaching F1 scores of about 0.96–0.99, while prompt-based LLMs still perform strongly in zero-shot (F1 roughly 0.82–0.91) without labeled training data.
- The study includes analysis of practical deployment considerations including cost-accuracy trade-offs, latency, and common failure modes across approaches.
- The authors release code and configurations to support reproducibility and provide practitioner-oriented guidelines for selecting methods under constraints like label scarcity, latency, and budget.




