Structure-Aware Chunking for Tabular Data in Retrieval-Augmented Generation

arXiv cs.CL / 5/4/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper argues that common RAG chunking methods for unstructured text fail to leverage the inherent structure of tabular data like CSV and Excel.
It introduces a structure-aware tabular chunking (STC) framework that builds a hierarchical Row Tree and encodes each row as a key-value block for structure-aligned splitting and merging.
STC uses token-constrained splitting at structural boundaries and overlap-free greedy merging to create dense, non-overlapping chunks that better preserve intra-row field relationships.
Experiments on the MAUD dataset show up to a 40% and 56% reduction in chunk counts versus recursive and key-value baselines, alongside better token utilization and efficiency.
Retrieval experiments report substantial gains, including MRR improving from 0.3576 to 0.5945 (hybrid) and Recall@1 rising from 0.366 to 0.754 with BM25-only retrieval.

Abstract

Tabular documents such as CSV and Excel files are widely used in enterprise data pipelines, yet existing chunking strategies for retrieval-augmented generation (RAG) are primarily designed for unstructured text and do not account for tabular structure. We propose a structure-aware tabular chunking (STC) framework that operates on row-level units by constructing a hierarchical Row Tree representation, where each row is encoded as a key-value block. STC performs token-constrained splitting aligned with structural boundaries and applies overlap-free greedy merging to produce dense, non-overlapping chunks. This design preserves semantic relationships between fields within a row while improving token utilization and reducing fragmentation. Across evaluations on the MAUD dataset, STC reduces chunk count by up to 40% and 56% compared to standard recursive and key-value based baselines, respectively, while improving token utilization and processing efficiency. In retrieval benchmarks, STC improves MRR from 0.3576 to 0.5945 in a hybrid setting and increases Recall@1 from 0.366 to 0.754 in BM25-only retrieval. These results demonstrate that preserving structure during chunking improves retrieval performance, highlighting the importance of structure-aware chunking for RAG over tabular data.