Counting to Four is still a Chore for VLMs
arXiv cs.CV / 4/14/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper investigates why vision-language models (VLMs) still struggle with simple object counting despite strong performance on harder multimodal reasoning tasks.
- It introduces COUNTINGTRICKS, a controlled evaluation suite that varies shape-counting cases, patchification layouts, and adversarial prompting to pinpoint failure modes beyond just checking final answers.
- Attention analysis and component probing show that count-relevant visual evidence is strongest in the modality projection stage but drops in later language layers, where text priors increasingly dominate.
- The authors evaluate Modality Attention Share (MAS), a lightweight intervention intended to enforce a minimum allocation of visual attention during answer generation, reducing counting failures.
- The study includes plans to release the code and dataset to enable replication and further mechanistic analysis of VLM counting behavior.
Related Articles

Emerging Properties in Unified Multimodal Pretraining
Dev.to

Build a Profit-Generating AI Agent with LangChain: A Step-by-Step Tutorial
Dev.to

Open source AI is winning — but here's why I still pay $2/month for Claude API
Dev.to

AI Agents Need Real Email Infrastructure
Dev.to

Beyond the Prompt: Why AI Agents Are Hitting the Deployment Wall
Dev.to