TopoChunker: Topology-Aware Agentic Document Chunking Framework

arXiv cs.CL / 3/20/2026

📰 NewsModels & Research

共有:

Key Points

TopoChunker introduces a topology-aware framework for document chunking in retrieval-augmented generation by mapping content into a Structured Intermediate Representation to preserve cross-segment dependencies.
It uses a dual-agent system: an Inspector Agent routes documents along cost-optimized extraction paths, and a Refiner Agent audits capacity and disambiguates topological context to reconstruct hierarchical lineage.
The approach achieves state-of-the-art results on GutenQA and GovReport, outperforming strong LLM baselines by 8.0 percentage points in absolute generation accuracy and Recall@3 of 83.26%.
It also reduces token overhead by 23.5%, offering a scalable solution for structure-aware RAG and potentially shaping future RAG pipelines.

Abstract

Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.

Two bots, one confused server: what Nimbus revealed about AI agent identity

Dev.to

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark forFinance

Dev.to

A Coding Implementation to Build an Uncertainty-Aware LLM System with Confidence Estimation, Self-Evaluation, and Automatic Web Research

MarkTechPost

DNA Memory: Making AI Agents Learn, Forget, and Evolve Like a Human Brain

Dev.to

Tinybox- offline AI device 120B parameters

Hacker News

TopoChunker: Topology-Aware Agentic Document Chunking Framework

Key Points

Abstract

Related Articles

Two bots, one confused server: what Nimbus revealed about AI agent identity

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark forFinance

A Coding Implementation to Build an Uncertainty-Aware LLM System with Confidence Estimation, Self-Evaluation, and Automatic Web Research

DNA Memory: Making AI Agents Learn, Forget, and Evolve Like a Human Brain

Tinybox- offline AI device 120B parameters

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer