AI Navigate

Semantic Chameleon: Corpus-Dependent Poisoning Attacks and Defenses in RAG Systems

arXiv cs.AI / 3/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper examines gradient-guided corpus poisoning attacks in Retrieval-Augmented Generation (RAG) systems, showing attackers can manipulate the retrieval corpus to bias model outputs.
  • It introduces dual-document poisoning (a sleeper document and a trigger document) optimized with Greedy Coordinate Gradient, achieving a 38.0 percent co-retrieval rate under pure vector retrieval on a 67,941-document Security Stack Exchange corpus across 50 attack attempts.
  • A simple defense—hybrid retrieval combining BM25 and vector similarity—greatly reduces attack success, lowering it from 38% to 0% without modifying the LLM or retraining the retriever; attackers can still partially circumvent if payloads target both sparse and dense signals.
  • Cross-model evaluation across GPT-5.3, GPT-4o, Claude Sonnet 4.6, Llama 4, and GPT-4o-mini shows attack success ranging from 46.7% to 93.3%, while cross-corpus FEVER experiments yield 0% success across configurations, indicating the defense can be robust but dataset- and model-dependent.

Abstract

Retrieval-Augmented Generation (RAG) systems extend large language models (LLMs) with external knowledge sources but introduce new attack surfaces through the retrieval pipeline. In particular, adversaries can poison retrieval corpora so that malicious documents are preferentially retrieved at inference time, enabling targeted manipulation of model outputs. We study gradient-guided corpus poisoning attacks against modern RAG pipelines and evaluate retrieval-layer defenses that require no modification to the underlying LLM. We implement dual-document poisoning attacks consisting of a sleeper document and a trigger document optimized using Greedy Coordinate Gradient (GCG). In a large-scale evaluation on the Security Stack Exchange corpus (67,941 documents) with 50 attack attempts, gradient-guided poisoning achieves a 38.0 percent co-retrieval rate under pure vector retrieval. We show that a simple architectural modification, hybrid retrieval combining BM25 and vector similarity, substantially mitigates this attack. Across all 50 attacks, hybrid retrieval reduces gradient-guided attack success from 38 percent to 0 percent without modifying the model or retraining the retriever. When attackers jointly optimize payloads for both sparse and dense retrieval signals, hybrid retrieval can be partially circumvented, achieving 20-44 percent success, but still significantly raises attack difficulty relative to vector-only retrieval. Evaluation across five LLM families (GPT-5.3, GPT-4o, Claude Sonnet 4.6, Llama 4, and GPT-4o-mini) shows attack success ranging from 46.7 percent to 93.3 percent. Cross-corpus evaluation on the FEVER Wikipedia dataset (25 attacks) yields 0 percent attack success across all retrieval configurations.