Semantic Entanglement in Vector-Based Retrieval: A Formal Framework and Context-Conditioned Disentanglement Pipeline for Agentic RAG Systems

arXiv cs.AI / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies “semantic entanglement” in vector-based retrieval: when documents interleave topics in contiguous text, standard embeddings can place semantically distinct content in overlapping neighborhoods.
  • It formalizes entanglement with an Entanglement Index (EI) and argues that higher EI inherently limits achievable Top-K retrieval precision when using cosine-similarity retrieval.
  • To mitigate this failure mode, the authors propose a Semantic Disentanglement Pipeline (SDP), a four-stage preprocessing framework that restructures documents before embedding to reduce cross-topic overlap.
  • The work further introduces context-conditioned preprocessing (document shaping based on operational usage patterns) plus continuous feedback to adapt structure according to agent performance.
  • In a healthcare enterprise knowledge base (2,000+ documents, ~25 sub-domains), Top-K retrieval precision rises from ~32% with fixed-token chunking to ~82% with SDP, while mean EI drops from 0.71 to 0.14.

Abstract

Retrieval-Augmented Generation (RAG) systems depend on the geometric properties of vector representations to retrieve contextually appropriate evidence. When source documents interleave multiple topics within contiguous text, standard vectorization produces embedding spaces in which semantically distinct content occupies overlapping neighborhoods. We term this condition semantic entanglement. We formalize entanglement as a model-relative measure of cross-topic overlap in embedding space and define an Entanglement Index (EI) as a quantitative proxy. We argue that higher EI constrains attainable Top-K retrieval precision under cosine similarity retrieval. To address this, we introduce the Semantic Disentanglement Pipeline (SDP), a four-stage preprocessing framework that restructures documents prior to embedding. We further propose context-conditioned preprocessing, in which document structure is shaped by patterns of operational use, and a continuous feedback mechanism that adapts document structure based on agent performance. We evaluate SDP on a real-world enterprise healthcare knowledge base comprising over 2,000 documents across approximately 25 sub-domains. Top-K retrieval precision improves from approximately 32% under fixed-token chunking to approximately 82% under SDP, while mean EI decreases from 0.71 to 0.14. We do not claim that entanglement fully explains RAG failure, but that it captures a distinct preprocessing failure mode that downstream optimization cannot reliably correct once encoded into the vector space.