ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

arXiv cs.CV / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces ProMMSearchAgent, a multimodal search agent trained with process-oriented rewards to tackle sparse supervision and the unpredictability of live web environments.
It uses a Sim-to-Real training setup by decoupling policy learning into a deterministic local static sandbox, improving stability compared with training directly on the live web.
The approach adds an introspective, process-based reward that probes the agent’s knowledge limits to generate dense guidance on when to choose correct cognitive actions and when to initiate multimodal or text search.
Experiments show zero-shot transfer to the live Google Search API and report new state-of-the-art results, outperforming MMSearch-R1 across multiple benchmarks.
Reported gains include +5.1% on FVQA-test, +6.3% on InfoSeek, and +11.3% on MMSearch, indicating strong generalization for knowledge-intensive visual reasoning.

Abstract

Training multimodal agents via reinforcement learning for knowledge-intensive visual reasoning is fundamentally hindered by the extreme sparsity of outcome-based supervision and the unpredictability of live web environments. To resolve these algorithmic and environmental bottlenecks, we introduce ProMMSearchAgent, establishing a novel Sim-to-Real training paradigm for multimodal search. We decouple policy learning into a deterministic, local static sandbox. Crucially, to learn effectively within this constrained environment, we propose an introspective process-oriented reward. By probing the agent's own parametric knowledge boundaries, we generate dense behavioral metadata that explicitly rewards the correct cognitive decision, initiating a multimodal or text search only when visually or factually uncertain. Extensive experiments demonstrate that our locally-trained policy transfers zero-shot to the live Google Search API. ProMMSearchAgent achieves new SOTA performance, outperforming MMSearch-R1 by +5.1% on FVQA-test, +6.3% on InfoSeek, and +11.3% on MMSearch.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans

Dev.to

Why use an AI gateway at all?

Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago

Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity

Dev.to

ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans

Why use an AI gateway at all?

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer