TopoChunker: Topology-Aware Agentic Document Chunking Framework

arXiv cs.CL / 3/20/2026

📰 NewsModels & Research

共有:

Key Points

TopoChunker introduces a topology-aware framework for document chunking in retrieval-augmented generation by mapping content into a Structured Intermediate Representation to preserve cross-segment dependencies.
It uses a dual-agent system: an Inspector Agent routes documents along cost-optimized extraction paths, and a Refiner Agent audits capacity and disambiguates topological context to reconstruct hierarchical lineage.
The approach achieves state-of-the-art results on GutenQA and GovReport, outperforming strong LLM baselines by 8.0 percentage points in absolute generation accuracy and Recall@3 of 83.26%.
It also reduces token overhead by 23.5%, offering a scalable solution for structure-aware RAG and potentially shaping future RAG pipelines.

Abstract

Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.

When AI Grows Up: Identity, Memory, and What Persists Across Versions

Dev.to

OpenAI is throwing everything into building a fully automated researcher

MIT Technology Review

Kimi just published a paper replacing residual connections in transformers. results look legit

Reddit r/LocalLLaMA

機械学習の最適化対象まとめ（E資格対策にも）

Qiita

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026

Dev.to

TopoChunker: Topology-Aware Agentic Document Chunking Framework

Key Points

Abstract

Related Articles

When AI Grows Up: Identity, Memory, and What Persists Across Versions

OpenAI is throwing everything into building a fully automated researcher

Kimi just published a paper replacing residual connections in transformers. results look legit

機械学習の最適化対象まとめ（E資格対策にも）

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer