Counting to Four is still a Chore for VLMs

arXiv cs.CV / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates why vision-language models (VLMs) still struggle with simple object counting despite strong performance on harder multimodal reasoning tasks.
It introduces COUNTINGTRICKS, a controlled evaluation suite that varies shape-counting cases, patchification layouts, and adversarial prompting to pinpoint failure modes beyond just checking final answers.
Attention analysis and component probing show that count-relevant visual evidence is strongest in the modality projection stage but drops in later language layers, where text priors increasingly dominate.
The authors evaluate Modality Attention Share (MAS), a lightweight intervention intended to enforce a minimum allocation of visual attention during answer generation, reducing counting failures.
The study includes plans to release the code and dataset to enable replication and further mechanistic analysis of VLM counting behavior.

Abstract

Vision--language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this finding, we further evaluate Modality Attention Share (MAS), a lightweight intervention that encourages a minimum budget of visual attention during answer generation. Our results suggest that counting failures in VLMs stem not only from visual perception limits, but also from the underuse of visual evidence during language-stage reasoning. Code and dataset will be released at https://github.com/leduy99/-CVPRW26-Modality-Attention-Share.