V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

arXiv cs.CV / 4/27/2026

💬 OpinionModels & Research

共有:

Key Points

The paper introduces V-MAGE, a game-based evaluation framework to assess vision-centric capabilities of multimodal large language models (MLLMs) in interactive, dynamic environments rather than static image-text benchmarks.
V-MAGE includes five video games and more than 30 carefully designed scenarios in visually complex, free-form settings that require decision-making from visual input over time.
It uses a dynamic, ELO-based ranking system to enable robust, interpretable comparisons across models while accounting for different difficulty levels and task diversity.
Experiments benchmark leading MLLMs against human baselines, showing near-human performance on simple tasks but a substantial performance drop on complex reasoning and task orchestration, indicating limitations in frame-by-frame, vision-grounded interactive control.
The authors provide public code and show through analyses that V-MAGE can identify concrete weaknesses and offer actionable guidance for improving MLLMs’ visual reasoning in dynamic interactions.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in visual-text processing. However, existing static image-text benchmarks are insufficient for evaluating their dynamic perception and interactive reasoning abilities. We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework designed to systematically assess MLLMs' visual reasoning in interactive, continuous-space environments. V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios. These scenarios are set in free-form, visually complex environments that require models to interpret dynamic game states and make decisions based solely on visual input, thereby closely reflecting the conditions encountered by human players. To ensure robust and interpretable comparisons across models, V-MAGE employs a dynamic ELO-based ranking system that accounts for varying difficulty levels and task diversity. Benchmarking state-of-the-art MLLMs against human baselines reveals that while leading models approach human-level performance in simple tasks, their performance drops significantly in complex scenarios requiring advanced reasoning and task orchestration. This persistent performance gap highlights fundamental limitations in current MLLMs' ability to perform vision-grounded, interactive frame-by-frame control in simulated continuous-time environments. Through extensive analyses, we demonstrate the utility of V-MAGE in uncovering these limitations and providing actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings. Code is publicly available at https://github.com/CSU-JPG/V-MAGE.