Semantic Chameleon: Corpus-Dependent Poisoning Attacks and Defenses in RAG Systems

arXiv cs.AI / 3/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper examines gradient-guided corpus poisoning attacks in Retrieval-Augmented Generation (RAG) systems, showing attackers can manipulate the retrieval corpus to bias model outputs.
It introduces dual-document poisoning (a sleeper document and a trigger document) optimized with Greedy Coordinate Gradient, achieving a 38.0 percent co-retrieval rate under pure vector retrieval on a 67,941-document Security Stack Exchange corpus across 50 attack attempts.
A simple defense—hybrid retrieval combining BM25 and vector similarity—greatly reduces attack success, lowering it from 38% to 0% without modifying the LLM or retraining the retriever; attackers can still partially circumvent if payloads target both sparse and dense signals.
Cross-model evaluation across GPT-5.3, GPT-4o, Claude Sonnet 4.6, Llama 4, and GPT-4o-mini shows attack success ranging from 46.7% to 93.3%, while cross-corpus FEVER experiments yield 0% success across configurations, indicating the defense can be robust but dataset- and model-dependent.

Abstract

Retrieval-Augmented Generation (RAG) systems extend large language models (LLMs) with external knowledge sources but introduce new attack surfaces through the retrieval pipeline. In particular, adversaries can poison retrieval corpora so that malicious documents are preferentially retrieved at inference time, enabling targeted manipulation of model outputs. We study gradient-guided corpus poisoning attacks against modern RAG pipelines and evaluate retrieval-layer defenses that require no modification to the underlying LLM. We implement dual-document poisoning attacks consisting of a sleeper document and a trigger document optimized using Greedy Coordinate Gradient (GCG). In a large-scale evaluation on the Security Stack Exchange corpus (67,941 documents) with 50 attack attempts, gradient-guided poisoning achieves a 38.0 percent co-retrieval rate under pure vector retrieval. We show that a simple architectural modification, hybrid retrieval combining BM25 and vector similarity, substantially mitigates this attack. Across all 50 attacks, hybrid retrieval reduces gradient-guided attack success from 38 percent to 0 percent without modifying the model or retraining the retriever. When attackers jointly optimize payloads for both sparse and dense retrieval signals, hybrid retrieval can be partially circumvented, achieving 20-44 percent success, but still significantly raises attack difficulty relative to vector-only retrieval. Evaluation across five LLM families (GPT-5.3, GPT-4o, Claude Sonnet 4.6, Llama 4, and GPT-4o-mini) shows attack success ranging from 46.7 percent to 93.3 percent. Cross-corpus evaluation on the FEVER Wikipedia dataset (25 attacks) yields 0 percent attack success across all retrieval configurations.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/20DailyView insight →

The massive shift toward edge computing and local processing

Dev.to

Self-Refining Agents in Spec-Driven Development

Dev.to

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

Dev.to

The Three-Agent Protocol Is Transferable. The Discipline Isn't.

Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

Reddit r/LocalLLaMA

Semantic Chameleon: Corpus-Dependent Poisoning Attacks and Defenses in RAG Systems

Key Points

Abstract

💡 Insights using this article

Related Articles

The massive shift toward edge computing and local processing

Self-Refining Agents in Spec-Driven Development

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

The Three-Agent Protocol Is Transferable. The Discipline Isn't.

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer