Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

arXiv cs.AI / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current multimodal agent evaluations are inadequate because they don’t flexibly test tool use, don’t isolate visual vs. web/search tools cleanly, and often judge only final answers rather than whether tools were correctly invoked and applied.
It introduces Agentic-MME, a process-verified multimodal benchmark with 418 real-world tasks across 6 domains and 3 difficulty levels, including 2,000+ stepwise checkpoints validated with fine-grained intermediate-state auditing.
The benchmark evaluates “capability synergy” between visual expansion (using visual tools) and knowledge expansion (using open-web search) using a unified framework that supports sandboxed code and APIs plus human reference trajectories.
Models are scored not only on correctness (e.g., Gemini3-pro’s 56.3% overall accuracy) but also on process efficiency via an “overthinking” metric, with performance dropping to 23.0% on the hardest Level-3 tasks.
Overall, the results highlight that real-world multimodal agentic problem solving remains challenging and that process-level verification can expose weaknesses masked by end-answer-only metrics.

Abstract

Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.