Zero-shot Object Counting with Fine-Grained Cross-Modal Enhancement and Multi-Scale Semantic Adaptation

Abstract

Zero-shot object counting estimates object quantities from textual inputs without category-specific training. While detection-based methods like the pre-trained Grounding DINO offer a promising baseline, their performance is hindered by two critical limitations: insufficient fine-grained alignment between visual and textual features, leading to confusion among semantically similar objects, and poor adaptability to multi-scale object distributions in complex scenes. To address these issues, we propose an enhanced framework built upon Grounding DINO for zero-shot object counting that integrates: (1) a fine-grained cross-modal enhancement module, which refines visual-text alignment via deep feature interaction and cross-attention mechanisms, explicitly modeling object-attribute relationships; and (2) a multi-scale semantic adaptation module, which dynamically fuses multi-level visual features with textual semantics to achieve scale-invariant counting. Extensive experiments on FSC-147, CARPK, and CountBench demonstrate superiority of the proposed method. On FSC-147, our method achieves 11.74 MAE and 58.36 RMSE on the validation set, and 10.73 MAE and 103.63 RMSE on the test set, achieving 3.3% and 17.3% lower MAE than baseline method CountGD on validation and test sets, respectively. For cross-dataset generalization, it attains lower MAE/RMSE on CARPK and lower RMSE on CountBench than baseline, demonstrating robust zero-shot capability.

Keywords

zero-shot object counting multi-modal foundational model cross-modality text prompt

Get full access to this article

View all access options for this article.

References

Amini-Naieni

Han

Zisserman

(2023). Open-world text-specified object counting. arXiv preprint arXiv:2306.01851, 2023.

Amini-Naieni

Han

Zisserman

(2024). Countgd: Multi-modal open-world counting. Advances in Neural Information Processing Systems, 37, 48810–48837. https://doi.org/10.52202/079017-1547

Chen

Gao

Zhai

Jeon

Camacho

(2024). Towards zero-shot object counting via deep spatial prior cross-modality fusion. Information Fusion, 111, 102537. https://doi.org/10.1016/j.inffus.2024.102537

Dai

Liu

Cheung

N.-M.

(2024). Referring expression counting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16985–16995). IEEE.

Devlin

Chang

M.-W.

Lee

Toutanova

(2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Association for Computational Linguistics.

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

Dehghani

Minderer

Heigold

Gelly

, et al. (2020). An image is worth 16 (16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

Ðukić

Lukežič

Zavrtanik

Kristan

(2023). A low-shot object counting network with iterative prototype adaptation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 18872–18881). IEEE.

Hobley

Prisacariu

(2022). Learning to count anything: Reference-less class-agnostic counting with weak supervision. arXiv preprint arXiv:2205.10203, 2022.

Hsieh

M.-R.

Lin

Y.-L.

Hsu

W. H.

(2017). Drone-based object counting by spatially regularized regional proposal network. In Proceedings of the IEEE international conference on computer vision (pp. 4145–4153). IEEE.

10.

Huang

Dai

Zhang

Shan

(2024). Point segment and count: A generalized framework for object counting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 17067–17076). IEEE.

11.

Jiang

Liu

Chen

(2023). Clip-count: Towards text-guided zero-shot object counting. In Proceedings of the 31st ACM international conference on multimedia (pp. 4535–4545). ACM.

12.

Kang

Moon

Kim

Heo

J.-P.

(2024). Vlcounter: Text-aware visual representation for zero-shot object counting. Proceedings of the AAAI Conference on Artificial Intelligence , 38, 2714–2722. https://doi.org/10.1609/aaai.v38i3.28050

13.

Kirillov

Mintun

Ravi

Mao

Rolland

Gustafson

Xiao

Whitehead

Berg

A. C.

W.-Y.

, et al. (2023). Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4015–4026). IEEE.

14.

Liang

Bai

(2022). An end-to-end transformer model for crowd localization. In European conference on computer vision (pp. 38–54). Springer.

15.

Liu

Zhong

Zisserman

Xie

(2022). Countr: Transformer-based generalised visual counting. arXiv preprint arXiv:2208.13721, 2022.

16.

Liu

Zeng

Ren

Zhang

Yang

Jiang

Yang

, et al. (2024). Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision (pp. 38–55). Springer.

17.

Liu

Lin

Cao

Wei

Zhang

Lin

Guo

(2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022). IEEE.

18.

Xie

Zisserman

(2019). Class-agnostic counting. In Computer vision–ACCV 2018: 14th Asian conference on computer vision, perth, Australia, December 2–6, 2018, revised selected papers, part III 14 (pp. 669–684). Springer.

19.

Paiss

Ephrat

Tov

Zada

Mosseri

Irani

Dekel

(2023). Teaching clip to count to ten. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3170–3180). IEEE.

20.

Qian

Guo

Deng

Tong Lei

Zhao

Lau

C. P.

Hong

Pound

M. P.

(2025). T2icount: Enhancing cross-modal understanding for zero-shot counting. In Proceedings of the computer vision and pattern recognition conference (pp. 25336–25345). IEEE.

21.

Qian

Hong

Guo

Arandjelović

Donovan

C. R.

(2024). Semi-supervised crowd counting with contextual modeling: Facilitating holistic understanding of crowd scenes. In IEEE Transactions on Circuits and Systems for Video Technology (pp. 8230–8241). IEEE.

22.

Radford

Kim

J. W.

Hallacy

Ramesh

Goh

Agarwal

Sastry

Askell

Mishkin

Clark

, et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PmLR.

23.

Ranjan

Nguyen

M. H.

(2022). Exemplar free class agnostic counting. In Proceedings of the Asian conference on computer vision (pp. 3121–3137). Springer.

24.

Ranjan

Sharma

Nguyen

Hoai

(2021). Learning to count everything. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3394–3403). IEEE.

25.

Shi

Feng

Liu

Cao

(2022). Represent, compare, and learn: A similarity-aware framework for class-agnostic counting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9529–9538). IEEE.

26.

Song

Wang

Jiang

Wang

Tai

Wang

Huang

(2021). Rethinking counting and localization in crowds: A purely point-based framework. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3365–3374). IEEE.

27.

Wan

Liu

Chan

A. B.

(2021). A generalized loss function for crowd counting and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1974–1983). IEEE.

28.

Wang

Zhou

Taylor

G. W.

Gong

(2024). Gcnet: Probing self-similarity learning for generalized counting network. Pattern Recognition, 153, 110513. https://doi.org/10.1016/j.patcog.2024.110513

29.

Luo

Wang

Chen

Zhao

Noguchi

Zhao

Sun

(2026). Intelligent prediction framework for thermogravimetric behavior of cement-based materials with theoretical model embedding. Engineering Applications of Artificial Intelligence, 163, 112899. https://doi.org/10.1016/j.engappai.2025.112899

30.

Nguyen

Ranjan

Samaras

(2023). Zero-shot object counting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15548–15557). IEEE.

31.

You

Yang

Luo

Cui

(2023). Few-shot object counting with similarity-aware feature enhancement. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 6315–6324). IEEE.

32.

Zhai

Gao

Guo

Jeon

(2023). Scale-context perceptive network for crowd counting and localization in smart city system. IEEE Internet of Things Journal, 10(21), 18930–18940. https://doi.org/10.1109/JIOT.2023.3268226

33.

Zhai

Xing

Gao

(2024). Zero-shot object counting with vision-language prior guidance network. IEEE Transactions on Circuits and Systems for Video Technology, 35(3), 2487–2498. https://doi.org/10.1109/TCSVT.2024.3488721

34.

Zhu

Yuan

Yang

Guo

Wang

Zhong

(2024). Zero-shot object counting with good exemplars. In European conference on computer vision (pp. 368–385). Springer.