MURE: Hierarchical Multi-Resolution Encoding via Vision-Language Models for Visual Document Retrieval
arXiv cs.CV / 3/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- Visual Document Retrieval requires representations that capture both fine-grained visual details and global document structure, but existing models either lose fine details or incur high indexing costs and retrieval latency.
- The authors introduce the X-VisEmb paradigm featuring multi-resolution sampling and encoding, cross-granularity feature fusion, and adaptive representation distillation to fuse cues across scales.
- Building on X-VisEmb, MURE uses vision-language models as a hierarchical multi-resolution encoder, introduces Matryoshka-style resolution-level representation learning for effective feature fusion, and applies semantic-aware hierarchical clustering to compress visual tokens.
- Experiments on two VDR benchmarks show MURE consistently beats strong baselines and outperforms ColPali with only 50% of its visual token budget, reducing indexing overhead and retrieval latency.




