A Case Study on the Impact of Anonymization Along the RAG Pipeline

arXiv cs.CL / 4/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The case study examines how anonymization affects Retrieval-Augmented Generation (RAG) systems, focusing on privacy risks from PII leakage to the LLM or end users.
  • It addresses a gap in prior work by testing where anonymization should be applied within the RAG pipeline rather than treating it as a one-size-fits-all preprocessing step.
  • The researchers empirically measure the impact of anonymization at two key stages: the underlying dataset stage and the generated-answer stage.
  • The results show that the privacy–utility trade-off varies depending on the placement, highlighting the importance of choosing the right anonymization point to mitigate risk without harming quality unnecessarily.

Abstract

Despite the considerable promise of Retrieval-Augmented Generation (RAG), many real-world use cases may create privacy concerns, where the purported utility of RAG-enabled insights comes at the risk of exposing private information to either the LLM or the end user requesting the response. As a potential mitigation, using anonymization techniques to remove personally identifiable information (PII) and other sensitive markers in the underlying data represents a practical and sensible course of action for RAG administrators. Despite a wealth of literature on the topic, no works consider the placement of anonymization along the RAG pipeline, i.e., asking the question, where should anonymization happen? In this case study, we systematically and empirically measure the impact of anonymization at two important points along the RAG pipeline: the dataset and generated answer. We show that differences in privacy-utility trade-offs can be observed depending on where anonymization took place, demonstrating the significance of privacy risk mitigation placement in RAG.