Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines

arXiv cs.AI / 4/28/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a strict black-box adversarial threat model for multi-component NLP pipelines, assuming binary-only feedback, no gradients, and tight query budgets.
  • It proposes a two-agent “semantic perturbation” evasion framework where an Attacker Agent performs meaning-preserving rewrites and a Prompt Optimization Agent refines the attack using only binary decisions within a 10-query limit.
  • Experiments on four misinformation detection pipelines show high evasion rates (19.95%–40.34%) against modern LLM-based systems, far outperforming token-level perturbation baselines under the same constraints.
  • The study finds that architectural properties—evidence retrieval mechanisms, how retrieval couples with inference, and baseline classifier accuracy—strongly determine the attack surface and effectiveness.
  • It identifies four exploitation patterns across different pipeline stages and demonstrates that defenses informed by these patterns can cut evasion success by up to 65.18%.

Abstract

Multi-component natural language processing (NLP) pipelines are increasingly deployed for high-stakes decisions, yet no existing adversarial method can test their robustness under realistic conditions: binary-only feedback, no gradient access, and strict query budgets. We formalize this strict black-box threat model and propose a two-agent evasion framework operating in a semantic perturbation space. An Attacker Agent generates meaning-preserving rewrites while a Prompt Optimization Agent refines the attack strategy using only binary decision feedback within a 10-query budget. Evaluated against four evidence-based misinformation detection pipelines, the framework achieves evasion rates of 19.95 to 40.34% on modern large language model (LLM) based systems, compared to at most 3.90% for token-level perturbation baselines that rely on surrogate models because they cannot operate under our threat model. A legacy system relying on static lexical retrieval exhibits near-total vulnerability 97.02%, establishing a lower bound that exposes how architectural choices govern the attack surface. Evasion effectiveness is associated with three architectural properties: evidence retrieval mechanism, retrieval-inference coupling, and baseline classification accuracy. The iterative prompt optimization yields the largest marginal gains against the most robust targets, confirming that adaptive strategy discovery is essential when evasion is non-trivial. Analysis of successful rewrites reveals four exploitation patterns, each targeting failures at distinct pipeline stages. A pattern-informed defense reduces the evasion rate by up to 65.18%.