MURE: Hierarchical Multi-Resolution Encoding via Vision-Language Models for Visual Document Retrieval
arXiv cs.CV / 3/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- Visual Document Retrieval requires representations that capture both fine-grained visual details and global document structure, but existing models either lose fine details or incur high indexing costs and retrieval latency.
- The authors introduce the X-VisEmb paradigm featuring multi-resolution sampling and encoding, cross-granularity feature fusion, and adaptive representation distillation to fuse cues across scales.
- Building on X-VisEmb, MURE uses vision-language models as a hierarchical multi-resolution encoder, introduces Matryoshka-style resolution-level representation learning for effective feature fusion, and applies semantic-aware hierarchical clustering to compress visual tokens.
- Experiments on two VDR benchmarks show MURE consistently beats strong baselines and outperforms ColPali with only 50% of its visual token budget, reducing indexing overhead and retrieval latency.
Related Articles
Is AI becoming a bubble, and could it end like the dot-com crash?
Reddit r/artificial

Externalizing State
Dev.to

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.
Dev.to

My AI Does Not Have a Clock
Dev.to
How to settle on a coding LLM ? What parameters to watch out for ?
Reddit r/LocalLLaMA