Detection of adversarial intent in Human-AI teams using LLMs
arXiv cs.LG / 2026/3/24
📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper investigates how LLMs can detect adversarial or malicious intent within human-AI teams where agents assist with tasks like retrieval, programming, and decision-making.
- Using a dataset of multi-party human-AI conversations and outcomes from a real team over a 25-round horizon, the study frames malicious behavior detection as a problem based on interaction traces.
- The authors report that LLM-based supervisors can identify malicious behavior in real time and do so in a task-agnostic way without needing task-specific information.
- They find that the malicious behaviors of interest are difficult to catch using simple heuristics, implying that LLM-based defenders may improve robustness against certain attack classes.
