ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response
arXiv cs.LG / 4/24/2026
📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The paper introduces ARFBench, a new benchmark for time series question answering (TSQA) focused on detecting and reasoning about anomalies in software incident data.
- ARFBench includes 750 questions spanning 142 time series and 5.38M data points from 63 production incidents, sourced exclusively from Datadog’s internal telemetry.
- Evaluations across proprietary and open-source LLMs/VLMs and time series foundation models show that frontier VLMs outperform prior baselines, with GPT-5 leading at 62.7% accuracy and 51.9% F1.
- The authors propose a specialized TSFM+VLM hybrid approach and demonstrate that a post-trained prototype can reach comparable overall performance using a smaller amount of synthetic and real data.
- They also find complementary strengths between models and human experts by defining oracle selectors, reaching 82.8% F1 and 87.2% accuracy and setting a new “superhuman” frontier for future TSQA systems.
Related Articles

Black Hat USA
AI Business

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence
Dev.to

Context Engineering for Developers: A Practical Guide (2026)
Dev.to

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.
Dev.to
AI Visibility Tracking Exploded in 2026: 6 Tools Every Brand Needs Now
Dev.to