Can Vision Language Models Judge Action Quality? An Empirical Evaluation

arXiv cs.CL / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper empirically evaluates state-of-the-art vision-language models for Action Quality Assessment (AQA) across multiple activities (e.g., fitness, figure skating, diving) and across different task setups, representations, and prompting strategies.
Baseline results show major models (Gemini 3.1 Pro, Qwen3-VL, InternVL3.5) only achieve marginal performance above random chance, indicating limited capability for judging fine-grained movement quality.
Adding techniques like skeleton information, grounding instructions, reasoning structures, and in-context learning produces only sporadic gains, with no consistently effective strategy found.
The analysis identifies two systematic failure biases: overpredicting correct execution independent of visual evidence and being overly sensitive to superficial wording in prompts.
Contrastive task reformulation to address these biases yields minimal improvement, leading the authors to conclude the core limitation is deeper than prompt framing and that robust mitigation is needed before real-world deployment.

Abstract

Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models' limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.

GLM 5.1 tops the code arena rankings for open models

Reddit r/LocalLLaMA

can we talk about how AI has gotten really good at lying to you?

Reddit r/artificial

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014

Dev.to

Emergency Room and the Vanishing Moat

Dev.to

I Built a 100% Browser-Based OCR That Never Uploads Your Documents — Here's How

Dev.to

Can Vision Language Models Judge Action Quality? An Empirical Evaluation

Key Points

Abstract

Related Articles

GLM 5.1 tops the code arena rankings for open models

can we talk about how AI has gotten really good at lying to you?

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014

Emergency Room and the Vanishing Moat

I Built a 100% Browser-Based OCR That Never Uploads Your Documents — Here's How

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer