Can Vision Language Models Judge Action Quality? An Empirical Evaluation

arXiv cs.CL / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper empirically evaluates state-of-the-art vision-language models for Action Quality Assessment (AQA) across multiple activities (e.g., fitness, figure skating, diving) and across different task setups, representations, and prompting strategies.
  • Baseline results show major models (Gemini 3.1 Pro, Qwen3-VL, InternVL3.5) only achieve marginal performance above random chance, indicating limited capability for judging fine-grained movement quality.
  • Adding techniques like skeleton information, grounding instructions, reasoning structures, and in-context learning produces only sporadic gains, with no consistently effective strategy found.
  • The analysis identifies two systematic failure biases: overpredicting correct execution independent of visual evidence and being overly sensitive to superficial wording in prompts.
  • Contrastive task reformulation to address these biases yields minimal improvement, leading the authors to conclude the core limitation is deeper than prompt framing and that robust mitigation is needed before real-world deployment.

Abstract

Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models' limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.