Bag of Bags: Adaptive Visual Vocabularies for Genizah Join Image Retrieval

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Bag of Bags (BoB), a new image-retrieval method for identifying manuscript “joins” by retrieving other fragments from the same physical manuscript using fragment-specific visual vocabularies rather than a single global Bag-of-Words codebook.
  • BoB is trained via a sparse convolutional autoencoder on binarized fragment patches, then encodes connected components per page, clusters embeddings with per-image k-means, and compares fragments using set-to-set distances between their local vocabularies.
  • On Cairo Genizah fragment data, the best BoB variant using Chamfer distance improves retrieval performance to Hit@1 = 0.78 and MRR = 0.84 versus the strongest classical BoW baseline at Hit@1 = 0.74 and MRR = 0.80 (a 6.1% relative top-1 gain).
  • The authors also propose a mass-weighted BoB-OT variant that incorporates cluster population into prototype matching, with a theoretical approximation guarantee relative to full component-level optimal transport.
  • For scalability, the paper evaluates a two-stage approach (BoW shortlist followed by BoB-OT reranking) to balance retrieval accuracy and computational cost for larger manuscript collections.

Abstract

A join is a set of manuscript fragments identified as originally emanating from the same manuscript. We study manuscript join retrieval: Given a query image of a fragment, retrieve other fragments originating from the same physical manuscript. We propose Bag of Bags (BoB), an image-level representation that replaces the global-level visual codebook of classical Bag of Words (BoW) with a fragment-specific vocabulary of local visual words. Our pipeline trains a sparse convolutional autoencoder on binarized fragment patches, encodes connected components from each page, clusters the resulting embeddings with per image k-means, and compares images using set to set distances between their local vocabularies. Evaluated on fragments from the Cairo Genizah, the best BoB variant (viz.\@ Chamfer) achieves Hit@1 of 0.78 and MRR of 0.84, compared to 0.74 and 0.80, respectively, for the strongest BoW baseline (BoW-RawPatches-\chi^2), a 6.1\% relative improvement in top-1 accuracy. We furthermore study a mass-weighted BoB-OT variant that incorporates cluster population into prototype matching and present a formal approximation guarantee bounding its deviation from full component-level optimal transport. A two-stage pipeline using a BoW shortlist followed by BoB-OT reranking provides a practical compromise between retrieval strength and computational cost, supporting applicability to larger manuscript collections.