Towards Long-horizon Agentic Multimodal Search
arXiv cs.CV / 4/15/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper proposes LMM-Searcher, a long-horizon multimodal deep search framework that reduces multimodal context explosion and token costs via file-based storage of visual assets referenced by lightweight UIDs.
- It introduces a tailored fetch-image tool to load visual content on demand during active perception, enabling progressive, memory-efficient multimodal retrieval over many turns.
- The authors build a data synthesis pipeline that creates queries requiring complex cross-modal multi-hop reasoning, then distill 12K trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized long-horizon search agent.
- Experiments on four benchmarks show the approach scales to 100-turn search horizons and achieves state-of-the-art results among open-source models on tasks like MM-BrowseComp and MMSearch-Plus, with good generalizability across base models.
- The authors indicate that the code will be released publicly at the provided GitHub repository link.




