SPIRE: Structure-Preserving Interpretable Retrieval of Evidence

arXiv cs.CL / 4/24/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper argues that retrieval-augmented generation (RAG) over semi-structured documents like HTML suffers because document structure gets flattened into sequence-based chunks for embeddings and generation.
  • It proposes SPIRE, a structure-preserving retrieval pipeline that operates on tree-structured documents and represents retrieval candidates as addressable subdocuments (subselections) defined by structural primitives.
  • SPIRE introduces global and local contextualization methods: global adds non-local scaffolding (e.g., titles, headers, list/table structure) and local expands a seed within its structural neighborhood to produce compact, context-rich evidence.
  • The method includes an embedding-based candidate generator over sentence-seeded subdocuments and a query-time aggregation step that reuses shared structural context, followed by contextual filtering that re-scores candidates.
  • Experiments on HTML question-answering benchmarks show SPIRE produces higher-quality and more diverse citations than strong passage-based baselines under fixed retrieval budgets while remaining scalable.

Abstract

Retrieval-augmented generation over semi-structured sources such as HTML is constrained by a mismatch between document structure and the flat, sequence-based interfaces of today's embedding and generative models. Retrieval pipelines often linearize documents into fixed-size chunks before indexing, which obscures section structure, lists, and tables, and makes it difficult to return small, citation-ready evidence without losing the surrounding context that makes it interpretable. We present a structure-aware retrieval pipeline that operates over tree-structured documents. The core idea is to represent candidates as subdocuments: precise, addressable selections that preserve structural identity while deferring the choice of surrounding context. We define a small set of document primitives--paths and path sets, subdocument extraction by pruning, and two contextualization mechanisms. Global contextualization adds the non-local scaffolding needed to make a selection intelligible (e.g., titles, headers, list and table structure). Local contextualization expands a seed selection within its structural neighborhood to obtain a compact, context-rich view under a target budget. Building on these primitives, we describe an embedding-based candidate generator that indexes sentence-seeded subdocuments and a query-time, document-aware aggregation step that amortizes shared structural context. We then introduce a contextual filtering stage that re-scores retrieved candidates using locally contextualized views. Across experiments on HTML question-answering benchmarks, we find that preserving structure while contextualizing selections yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines, while maintaining scalability.