Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting

arXiv cs.CV / 3/18/2026

📰 NewsModels & Research

共有:

Key Points

Zero-shot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars.
The paper presents QICA, a framework that combines quantity perception with robust spatial feature aggregation to enhance fine-grained counting and spatial awareness.
It introduces a Synergistic Prompting Strategy (SPS) to adapt vision and language encoders using numerically conditioned prompts, linking semantic recognition with numerical reasoning.
A Cost Aggregation Decoder (CAD) operates on vision-text similarity maps and refines them through spatial aggregation to mitigate feature distortion and preserve zero-shot transferability.
A multi-level quantity alignment loss (L_MQA) enforces numerical consistency across the pipeline, with FSC-147 and zero-shot tests on CARPK and ShanghaiTech-A demonstrating strong generalization.

Abstract

Zero-shot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars. However, existing methods often treat counting as a coarse retrieval task, suffering from a lack of fine-grained quantity awareness. Furthermore, they frequently exhibit spatial insensitivity and degraded generalization due to feature space distortion during model adaptation.To address these challenges, we present \textbf{QICA}, a novel framework that synergizes \underline{q}uantity percept\underline{i}on with robust spatial \underline{c}ast \underline{a}ggregation. Specifically, we introduce a Synergistic Prompting Strategy (\textbf{SPS}) that adapts vision and language encoders through numerically conditioned prompts, bridging the gap between semantic recognition and quantitative reasoning. To mitigate feature distortion, we propose a Cost Aggregation Decoder (\textbf{CAD}) that operates directly on vision-text similarity maps. By refining these maps through spatial aggregation, CAD prevents overfitting while preserving zero-shot transferability. Additionally, a multi-level quantity alignment loss (

\mathcal{L}_{MQA}

) is employed to enforce numerical consistency across the entire pipeline. Extensive experiments on FSC-147 demonstrate competitive performance, while zero-shot evaluation on CARPK and ShanghaiTech-A validates superior generalization to unseen domains.