Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines
arXiv cs.AI / 4/28/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces a strict black-box adversarial threat model for multi-component NLP pipelines, assuming binary-only feedback, no gradients, and tight query budgets.
- It proposes a two-agent “semantic perturbation” evasion framework where an Attacker Agent performs meaning-preserving rewrites and a Prompt Optimization Agent refines the attack using only binary decisions within a 10-query limit.
- Experiments on four misinformation detection pipelines show high evasion rates (19.95%–40.34%) against modern LLM-based systems, far outperforming token-level perturbation baselines under the same constraints.
- The study finds that architectural properties—evidence retrieval mechanisms, how retrieval couples with inference, and baseline classifier accuracy—strongly determine the attack surface and effectiveness.
- It identifies four exploitation patterns across different pipeline stages and demonstrates that defenses informed by these patterns can cut evasion success by up to 65.18%.
Related Articles
LLMs will be a commodity
Reddit r/artificial
Indian Developers: How to Build AI Side Income with $0 Capital in 2026
Dev.to
HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility
Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally
Reddit r/LocalLLaMA

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant
Dev.to