Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

arXiv cs.CV / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses Visual Document Retrieval (VDR) by targeting the high storage and compute costs of multi-vector models while maintaining fine-grained matching quality.
  • It introduces ColChunk, a plug-and-play framework that performs multimodal late chunking using hierarchical clustering on patch-level embeddings plus a 2D positional prior for spatial-semantic coherence.
  • ColChunk adaptively groups visual content to create contextualized multi-vectors that preserve global context but significantly reduce the number of stored vectors.
  • Experiments on 24 VDR datasets show ColChunk can cut storage by over 90% and improves retrieval ranking quality by an average of 9 points in nDCG@5 across representative single-vector models.
  • The authors position ColChunk as a practical approach to balance retrieval accuracy and efficiency for deployable visual document systems.

Abstract

Multi-vector models dominate Visual Document Retrieval (VDR) due to their fine-grained matching capabilities, but their high storage and computational costs present a major barrier to practical deployment. In this paper, we propose ColChunk, a plug-and-play framework that introduces multimodal late chunking to construct efficient, contextualized multi-vectors. Unlike existing pruning or fixed-token approaches, ColChunk employs hierarchical clustering on patch-level embeddings, fused with a 2D position prior to ensure spatial-semantic coherence. This adaptive grouping allows for a content-aware representation that preserves global context while drastically reducing the vector count. Evaluations across 24 VDR datasets demonstrate ColChunk achieves over a 90% reduction in storage requirements while simultaneously delivering a 9-point average improvement in nDCG@5 across representative single-vector models. ColChunk provides a practical solution for balancing retrieval accuracy and efficiency in visual document systems.