Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

arXiv cs.CL / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • この論文は、RAGの成否を左右する文書チャンク分割に対し、「一律の手法では不十分」という問題意識のもと、文書ごとに最適なチャンク戦略を選ぶAdaptive Chunkingフレームワークを提案しています。
  • 参照の充足度(RC)、チャンク内の結束性(ICC)、文書の文脈的一貫性(DCC)、ブロック整合性(BI)、サイズ適合性(SC)という5つの「文書固有の内在評価指標」により、チャンク品質をモデルやプロンプトを変えずに独立評価できる設計にしています。
  • その実現のために、LLM-regex splitterとsplit-then-merge recursive splitterの2つの新しいチャンクャーと、適用後のターゲット整形(post-processing)も導入しています。
  • 法律・技術・社会科学など多分野のコーパスで、RAGの下流性能を大きく改善し、回答の正確性が72%(62-64%から)に向上し、成功した質問数も30%以上増加(65 vs.49)したと報告しています。
  • コードが公開されており、既存のRAGパイプラインに「文書に応じたチャンク選択」を組み込む実装上の道筋を示しています。

Abstract

The effectiveness of Retrieval-Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used "one-size-fits-all" approaches often fail to capture the nuanced structure and semantics of diverse texts. Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance. We challenge this paradigm by introducing Adaptive Chunking, a framework that selects the most suitable chunking strategy for each document based on a set of five novel intrinsic, document-based metrics: References Completeness (RC), Intrachunk Cohesion (ICC), Document Contextual Coherence (DCC), Block Integrity (BI), and Size Compliance (SC), which directly assess chunking quality across key dimensions. To support this framework, we also introduce two new chunkers, an LLM-regex splitter and a split-then-merge recursive splitter, alongside targeted post-processing techniques. On a diverse corpus spanning legal, technical, and social science domains, our metric-guided adaptive method significantly improves downstream RAG performance. Without changing models or prompts, our framework increases RAG outcomes, raising answers correctness to 72% (from 62-64%) and increasing the number of successfully answered questions by over 30% (65 vs. 49). These results demonstrate that adaptive, document-aware chunking, guided by a complementary suite of intrinsic metrics, offers a practical and effective path to more robust RAG systems. Code available at https://github.com/ekimetrics/adaptive-chunking.
広告