VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents
arXiv cs.CV / 3/18/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- VisBrowse-Bench introduces a new benchmark for visual-native search in multimodal browsing agents to evaluate visual reasoning during the search process.
- The benchmark comprises 169 VQA instances across multiple domains and uses multimodal evidence cross-validation through text-image retrieval and joint reasoning.
- Data were constructed by human experts via a multi-stage pipeline and underwent rigorous manual verification to ensure reliability.
- The authors propose an agent workflow to actively collect and reason over visual information during the search, guiding the browsing agent effectively.
- Evaluation on both open-source and closed-source models shows performance gaps (e.g., Claude-4.6-Opus at 47.6% accuracy; o3-deep-research at 41.1%), highlighting the ongoing challenges in visual-native multimodal search, with code and data released on GitHub.




