PushupBench: Your VLM is not good at counting pushups

arXiv cs.CV / 4/28/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper argues that existing vision-language models (VLMs) may understand video content but struggle with tasks requiring exact repetition counting, such as counting pushups.
  • It introduces PushupBench, a new evaluation dataset of 446 long-form video clips (average 36.7 seconds) designed specifically for repetition counting.
  • The strongest frontier model reaches 42.1% exact accuracy, while open-source 4B models achieve around ~6%, indicating a large gap in counting capability.
  • The authors show that accuracy alone can be misleading because weaker models often use “modal” (most frequent) counts instead of performing temporal reasoning.
  • They report that fine-tuning on pushup counting using only 1k samples improves broader video understanding benchmarks, suggesting counting can serve as a proxy for temporal reasoning.

Abstract

Large vision-language models (VLMs) can recognize \textit{what} happens in video but fail to count \textit{how many} times. We introduce \textbf{PushupBench}, 446 long-form clips (avg. 36.7s) for evaluating repetition counting. The best frontier model achieves 42.1\% exact accuracy; open-source 4B models score \sim6\%, matching supervised baselines. We show that accuracy alone misleads -- weaker models exploit the modal count rather than reason temporally. Fine-tuning on counting with 1k samples transfers to general video understanding: MVBench (+2.15), PerceptionTest (+1.88), TVBench (+4.54), suggesting counting is a proxy for broader temporal reasoning.PushupBench incorporated in \texttt{lmms-eval} (https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/1262) and hosted on (pushupbench.com/)