MTA-Agent: An Open Recipe for Multimodal Deep Search Agents

arXiv cs.CV / 4/9/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper introduces MTA-Agent, a multimodal deep-search agent that automatically selects tools and parameters to retrieve and validate evidence from both images and text for evidence-based QA synthesis.
  • It builds a verified multi-hop vision-language training dataset, MTA-Vision-DeepSearch, with 21K high-quality examples generated from VQA seeds and filtered via multi-stage checks for factual consistency and answer uniqueness.
  • Using this data, a 32B open-source multimodal search agent reportedly reaches 54.63% average across six benchmarks under the same tool settings, outperforming GPT-5 (51.86%) and Gemini variants.
  • The authors find training on their dataset increases reasoning depth and improves tool-use behavior, raising average search steps from 2.27 to 4.28 and producing more systematic search strategies.
  • They also propose a cost-saving training approach that replays cached tool interactions instead of making real-time calls, and they release the full dataset and implementation details for reproducibility.

Abstract

Multimodal large language models (MLLMs) have demonstrated strong capabilities in visual understanding, yet they remain limited in complex, multi-step reasoning that requires deep searching and integrating visual evidence with external knowledge. In this work, we address this challenge by constructing high-quality, verified multi-hop vision-language training data for multimodal deep-search agents. We propose a Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis (MTA-Agent), which automatically selects tools and their parameters to retrieve and validate evidence from both visual and textual sources and generates structured multi-hop question-answer trajectories. Starting from diverse VQA seed datasets, our pipeline produces a large-scale training dataset, MTA-Vision-DeepSearch, containing 21K high-quality multi-hop examples. The data is filtered through a multi-stage verification process to ensure factual consistency and answer uniqueness. Using MTA-Vision-DeepSearch, a 32B open-source multimodal search agent achieves state-of-the-art performance, reaching an average of 54.63\% across six challenging benchmarks, outperforming GPT-5 (51.86\%), Gemini-2.5-Pro (50.98\%), and Gemini-3-Pro (54.46\%) under the same tool settings. We further show that training on our data improves both reasoning depth and tool-use behavior, increasing the average number of steps from 2.27 to 4.28, and leading to more systematic and persistent search strategies. Additionally, we demonstrate that training can be performed without real-time tool calls by replaying cached interactions, significantly reducing training cost. Importantly, we present MTA-Agent as a fully open recipe for multimodal deep search: we release the entire dataset, training trajectories, and implementation details to enable reproducibility and future research on open multimodal search agents.