AI Navigate

Do Vision Language Models Understand Human Engagement in Games?

arXiv cs.CV / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates three vision–language models on the GameVibe Few‑Shot dataset across nine first‑person shooter games to assess whether visual cues alone can infer human engagement.
  • Zero‑shot predictions from VLMs are generally weak and often do not outperform simple per‑game majority‑class baselines; retrieval‑augmented prompting can improve pointwise engagement predictions in some settings.
  • Pairwise engagement change prediction remains consistently difficult across strategies, and theory‑guided prompting does not reliably help and may reinforce surface‑level shortcuts.
  • The findings suggest a perception–understanding gap in current VLMs: they can recognize visible gameplay cues but struggle to robustly infer human engagement across games.

Abstract

Inferring human engagement from gameplay video is important for game design and player-experience research, yet it remains unclear whether vision--language models (VLMs) can infer such latent psychological states from visual cues alone. Using the GameVibe Few-Shot dataset across nine first-person shooter games, we evaluate three VLMs under six prompting strategies, including zero-shot prediction, theory-guided prompts grounded in Flow, GameFlow, Self-Determination Theory, and MDA, and retrieval-augmented prompting. We consider both pointwise engagement prediction and pairwise prediction of engagement change between consecutive windows. Results show that zero-shot VLM predictions are generally weak and often fail to outperform simple per-game majority-class baselines. Memory- or retrieval-augmented prompting improves pointwise prediction in some settings, whereas pairwise prediction remains consistently difficult across strategies. Theory-guided prompting alone does not reliably help and can instead reinforce surface-level shortcuts. These findings suggest a perception--understanding gap in current VLMs: although they can recognize visible gameplay cues, they still struggle to robustly infer human engagement across games.