MTA-Agent: An Open Recipe for Multimodal Deep Search Agents
arXiv cs.CV / 4/9/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper introduces MTA-Agent, a multimodal deep-search agent that automatically selects tools and parameters to retrieve and validate evidence from both images and text for evidence-based QA synthesis.
- It builds a verified multi-hop vision-language training dataset, MTA-Vision-DeepSearch, with 21K high-quality examples generated from VQA seeds and filtered via multi-stage checks for factual consistency and answer uniqueness.
- Using this data, a 32B open-source multimodal search agent reportedly reaches 54.63% average across six benchmarks under the same tool settings, outperforming GPT-5 (51.86%) and Gemini variants.
- The authors find training on their dataset increases reasoning depth and improves tool-use behavior, raising average search steps from 2.27 to 4.28 and producing more systematic search strategies.
- They also propose a cost-saving training approach that replays cached tool interactions instead of making real-time calls, and they release the full dataset and implementation details for reproducibility.



