GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations
arXiv cs.CV / 3/12/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- GroundCount augments vision-language models with explicit spatial grounding from object detectors to mitigate counting hallucinations.
- The method achieves up to 81.3% counting accuracy on the Ovis2.5-2B model (a 6.6 percentage point improvement) and reduces inference time by about 22% by eliminating hallucination-driven reasoning loops for stronger models.
- Ablation results show that positional encoding benefits stronger models but can hinder weaker ones, and that removing confidence scores generally improves performance across most architectures.
- Compared with feature-level fusion, explicit symbolic grounding via structured prompts yields superior performance across most evaluated VLM architectures, though one model degrades due to incompatibility with iterative reflection mechanisms.




