PushupBench: Your VLM is not good at counting pushups

arXiv cs.CV / 4/28/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper argues that existing vision-language models (VLMs) may understand video content but struggle with tasks requiring exact repetition counting, such as counting pushups.
It introduces PushupBench, a new evaluation dataset of 446 long-form video clips (average 36.7 seconds) designed specifically for repetition counting.
The strongest frontier model reaches 42.1% exact accuracy, while open-source 4B models achieve around ~6%, indicating a large gap in counting capability.
The authors show that accuracy alone can be misleading because weaker models often use “modal” (most frequent) counts instead of performing temporal reasoning.
They report that fine-tuning on pushup counting using only 1k samples improves broader video understanding benchmarks, suggesting counting can serve as a proxy for temporal reasoning.

Abstract

Large vision-language models (VLMs) can recognize \textit{what} happens in video but fail to count \textit{how many} times. We introduce \textbf{PushupBench}, 446 long-form clips (avg. 36.7s) for evaluating repetition counting. The best frontier model achieves 42.1\% exact accuracy; open-source 4B models score

\sim

6\%, matching supervised baselines. We show that accuracy alone misleads -- weaker models exploit the modal count rather than reason temporally. Fine-tuning on counting with 1k samples transfers to general video understanding: MVBench (+2.15), PerceptionTest (+1.88), TVBench (+4.54), suggesting counting is a proxy for broader temporal reasoning.PushupBench incorporated in \texttt{lmms-eval} (https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/1262) and hosted on (pushupbench.com/)