GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces GameWorld, a new benchmark aimed at standardized and verifiable evaluation of multimodal (MLLM) agents acting as generalist game agents in browser environments.
  • It addresses current evaluation limitations such as latency, sparse feedback, irreversible mistakes, and the lack of consistent action interfaces and verification methods.
  • GameWorld covers 34 games and 170 tasks, and evaluates agent performance using state-verifiable, outcome-based metrics to enable reproducible comparisons.
  • The benchmark studies two agent interface types: computer-use agents emitting keyboard/mouse controls and multimodal agents using deterministic Semantic Action Parsing into a semantic action space.
  • Experiments across 18 model-interface pairs show that even top agents remain far from human-level performance, and additional tests highlight challenges in real-time interaction, context-memory sensitivity, and action validity.

Abstract

Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.