Detection of adversarial intent in Human-AI teams using LLMs

arXiv cs.LG / 2026/3/24

📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • The paper investigates how LLMs can detect adversarial or malicious intent within human-AI teams where agents assist with tasks like retrieval, programming, and decision-making.
  • Using a dataset of multi-party human-AI conversations and outcomes from a real team over a 25-round horizon, the study frames malicious behavior detection as a problem based on interaction traces.
  • The authors report that LLM-based supervisors can identify malicious behavior in real time and do so in a task-agnostic way without needing task-specific information.
  • They find that the malicious behaviors of interest are difficult to catch using simple heuristics, implying that LLM-based defenders may improve robustness against certain attack classes.

Abstract

Large language models (LLMs) are increasingly deployed in human-AI teams as support agents for complex tasks such as information retrieval, programming, and decision-making assistance. While these agents' autonomy and contextual knowledge enables them to be useful, it also exposes them to a broad range of attacks, including data poisoning, prompt injection, and even prompt engineering. Through these attack vectors, malicious actors can manipulate an LLM agent to provide harmful information, potentially manipulating human agents to make harmful decisions. While prior work has focused on LLMs as attack targets or adversarial actors, this paper studies their potential role as defensive supervisors within mixed human-AI teams. Using a dataset consisting of multi-party conversations and decisions for a real human-AI team over a 25 round horizon, we formulate the problem of malicious behavior detection from interaction traces. We find that LLMs are capable of identifying malicious behavior in real-time, and without task-specific information, indicating the potential for task-agnostic defense. Moreover, we find that the malicious behavior of interest is not easily identified using simple heuristics, further suggesting the introduction of LLM defenders could render human teams more robust to certain classes of attack.