Abstract
Zero-shot object counting estimates object quantities from textual inputs without category-specific training. While detection-based methods like the pre-trained Grounding DINO offer a promising baseline, their performance is hindered by two critical limitations: insufficient fine-grained alignment between visual and textual features, leading to confusion among semantically similar objects, and poor adaptability to multi-scale object distributions in complex scenes. To address these issues, we propose an enhanced framework built upon Grounding DINO for zero-shot object counting that integrates: (1) a fine-grained cross-modal enhancement module, which refines visual-text alignment via deep feature interaction and cross-attention mechanisms, explicitly modeling object-attribute relationships; and (2) a multi-scale semantic adaptation module, which dynamically fuses multi-level visual features with textual semantics to achieve scale-invariant counting. Extensive experiments on FSC-147, CARPK, and CountBench demonstrate superiority of the proposed method. On FSC-147, our method achieves 11.74 MAE and 58.36 RMSE on the validation set, and 10.73 MAE and 103.63 RMSE on the test set, achieving 3.3% and 17.3% lower MAE than baseline method CountGD on validation and test sets, respectively. For cross-dataset generalization, it attains lower MAE/RMSE on CARPK and lower RMSE on CountBench than baseline, demonstrating robust zero-shot capability.
Get full access to this article
View all access options for this article.
