CAN-QA: A Question-Answering Benchmark for Reasoning over In-Vehicle CAN Traffic

arXiv cs.LG / 4/29/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces CAN-QA, a new benchmark that reframes in-vehicle CAN intrusion detection from label classification into question answering with reasoning about traffic behavior.
  • CAN-QA turns raw CAN logs into temporally segmented windows and uses deterministic rule-based templates to create natural-language QA pairs with automatically generated ground-truth answers.
  • The dataset contains 33,128 question-answer pairs across 10 categories, each designed to test different semantic and temporal aspects of CAN traffic.
  • Experiments on large language models show they rely on superficial statistical patterns but perform poorly on temporal reasoning, multi-condition inference, and higher-level behavioral interpretation.
  • The authors provide an open-source code repository for using the benchmark.

Abstract

The Controller Area Network (CAN) is a safety-critical in-vehicle communication protocol that lacks built-in security mechanisms, making intrusion detection essential. Existing approaches predominantly formulate CAN intrusion detection as a classification task, mapping complex traffic patterns to attack labels. However, this formulation abstracts away the temporal and relational structure of CAN traffic and misaligns with real-world forensic workflows, which require systematic reasoning about traffic behavior. To address this gap, we introduce CAN-QA, the first benchmark that reformulates CAN traffic analysis as a question-answering (QA) task. CAN-QA converts raw CAN logs into temporally segmented windows and applies deterministic rule-based templates to generate natural-language questions paired with automatically derived ground-truth answers. The resulting dataset comprises 33,128 QA pairs across 10 categories, each targeting distinct semantic and temporal properties of CAN traffic. Using CAN-QA, we evaluate large language models across both True/False and multiple-choice formats. Our results indicate that, although these models capture superficial statistical regularities, they struggle with temporal reasoning, multi-condition inference, and higher-level behavioral interpretation. Our code is available at https://github.com/Kriiiiss/CAN-QA.