Can Vision Language Models Judge Action Quality? An Empirical Evaluation
arXiv cs.CL / 4/10/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper empirically evaluates state-of-the-art vision-language models for Action Quality Assessment (AQA) across multiple activities (e.g., fitness, figure skating, diving) and across different task setups, representations, and prompting strategies.
- Baseline results show major models (Gemini 3.1 Pro, Qwen3-VL, InternVL3.5) only achieve marginal performance above random chance, indicating limited capability for judging fine-grained movement quality.
- Adding techniques like skeleton information, grounding instructions, reasoning structures, and in-context learning produces only sporadic gains, with no consistently effective strategy found.
- The analysis identifies two systematic failure biases: overpredicting correct execution independent of visual evidence and being overly sensitive to superficial wording in prompts.
- Contrastive task reformulation to address these biases yields minimal improvement, leading the authors to conclude the core limitation is deeper than prompt framing and that robust mitigation is needed before real-world deployment.
Related Articles

GLM 5.1 tops the code arena rankings for open models
Reddit r/LocalLLaMA
can we talk about how AI has gotten really good at lying to you?
Reddit r/artificial

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014
Dev.to

Emergency Room and the Vanishing Moat
Dev.to

I Built a 100% Browser-Based OCR That Never Uploads Your Documents — Here's How
Dev.to