LITTA: Late-Interaction and Test-Time Alignment for Visually-Grounded Multimodal Retrieval
arXiv cs.AI / 3/31/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- LITTA is a test-time, query-expansion-centric framework for multimodal evidence page retrieval from visually complex documents like textbooks and manuals, where long context and weak lexical overlap make retrieval difficult.
- It uses a large language model to generate complementary query variants, then retrieves candidate pages with a frozen vision retriever using late-interaction scoring.
- Candidate lists from expanded queries are combined via reciprocal rank fusion to improve coverage and reduce dependence on any single query phrasing.
- Experiments on three domains (computer science, pharmaceuticals, industrial manuals) show that multi-query retrieval improves top-k accuracy, recall, and MRR versus single-query retrieval, especially where visual and semantic variability is high.
- LITTA also offers a controllable accuracy–latency trade-off by adjusting the number of query variants, and it remains compatible with existing multimodal embedding indices without retriever retraining.
Related Articles
Why AI agent teams are just hoping their agents behave
Dev.to

Harness as Code: Treating AI Workflows Like Infrastructure
Dev.to

How to Make Claude Code Better at One-Shotting Implementations
Towards Data Science

The Crypto AI Agent Stack That Costs $0/Month to Run
Dev.to

Bag of Freebies for Training Object Detection Neural Networks
Dev.to