AI Navigate

TopoChunker: Topology-Aware Agentic Document Chunking Framework

arXiv cs.CL / 3/20/2026

📰 NewsModels & Research

Key Points

  • TopoChunker introduces a topology-aware framework for document chunking in retrieval-augmented generation by mapping content into a Structured Intermediate Representation to preserve cross-segment dependencies.
  • It uses a dual-agent system: an Inspector Agent routes documents along cost-optimized extraction paths, and a Refiner Agent audits capacity and disambiguates topological context to reconstruct hierarchical lineage.
  • The approach achieves state-of-the-art results on GutenQA and GovReport, outperforming strong LLM baselines by 8.0 percentage points in absolute generation accuracy and Recall@3 of 83.26%.
  • It also reduces token overhead by 23.5%, offering a scalable solution for structure-aware RAG and potentially shaping future RAG pipelines.

Abstract

Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.