Sage Journals: Discover world-class research

Abstract

Environmental perception is a critical component of automated driving systems. Advancing environmental perception algorithms toward applications in open-world road scenarios is a current research trend. However, traditional single-modality detectors exhibit weak generalization capabilities, while modern vision-language models require manual input text prompts to function properly. Since conventional methods fail to meet the requirements of open-world environmental perception, this study proposes a prompt-free vision-language model to address this problem. The proposed model first introduces a prompt memory pretraining strategy, which stores text prompt memory by pretraining the model on large-scale object detection datasets. Subsequently, a dynamic prompt generation module is proposed to identify foreground categories within the vision modality input. It queries the prior text prompt memory to generate the corresponding text prompts. Extensive experiments demonstrate that the proposed method significantly outperforms conventional detectors and modern vision-language models, even when relying solely on vision modality input. The code and trained models are available at https://github.com/unbelieboomboom/prompt_free_G_DINO.

Keywords

Automated driving systems environmental perception object detection vision-language models dynamic prompt generation

Get full access to this article

View all access options for this article.

References

Arulananth

Kuppusamy

Ayyasamy

, et al. Semantic segmentation of urban environments: leveraging U-Net deep learning model for cityscape image analysis. PLoS One 2024; 19(4): e0300767.

Singh

. Critical reasons for crashes investigated in the national motor vehicle crash causation survey. Report No. DOT HS 812 115. NHTSA’s National Center for Statistics and Analysis, 2015.

Liu

Lin

Vision-based environmental perception for autonomous driving. Proc IMechE, Part D: J Automobile Engineering 2022; 239(1): 39–69.

Lin

Shi

Zhang

, et al. An anchor-free detector and R-CNN integrated neural network architecture for environmental perception of urban roads. Proc IMechE, Part D: J Automobile Engineering 2021; 235(12): 2964–2973.

Ren

Girshick

, et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 2016; 39(6): 1137–1149.

Cai

Vasconcelos

Cascade R-CNN: delving into high quality object detection. IEEE Trans Pattern Anal Mach Intell 2019; 43(5): 1483–1498.

Chen

Pang

Wang

, et al. Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 16 June–20 June 2019, pp.4974–4983. New York: IEEE.

Redmon

Divvala

Girshick

, et al. You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27 June–30 June 2016, pp.779–788. New York: IEEE.

Tian

Shen

Chen

, et al. FCOS: a simple and strong anchor-free object detector. IEEE Trans Pattern Anal Mach Intell 2022; 44(4): 1922–1933.

10.

Zhang

Chi

Yao

, et al. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA, 13 June–19 June 2020, pp.9759–9768. New York: IEEE.

11.

Liu

Wang

, et al. YOLOx: exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430. Epub ahead of print 6 August 2021. DOI: 10.48550/arXiv.2107.08430.

12.

Jiang

, et al. YOLOv6: a single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976. Epub ahead of print 7 September 2022. DOI: 10.48550/arXiv.2209.02976.

13.

Wang

Bochkovskiy

Liao

HYM

. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022. New York: IEEE.

14.

Rejin

Sambath

YOLOv8: a novel object detection algorithm with enhanced performance and robustness. In: International conference on advances in data engineering and intelligent computing systems, Chennai, India, 18 April–19 April 2024, pp.213–229. New York: IEEE.

15.

Peter

Jessica

Victor

, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Epub ahead of print 4 June 2018. DOI: 10.48550/arXiv.1806.01261.

16.

Carion

Massa

Synnaeve

, et al. End-to-end object detection with transformers. In: European conference on computer vision, Glasgow, UK, 23–28 August 2020, pp.213–229. Cham: Springer.

17.

Liu

Zhang

, et al. DAB-DETR: dynamic anchor boxes are better queries for DETR. arXiv preprint arXiv:2201.12329. Epub ahead of print 28 January 2022. DOI: 10.48550/arXiv.2201.12329.

18.

Zhang

Liu

, et al. DN-DETR: accelerate DETR training by introducing query denoising. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA, 18 June–24 June 2022, pp.13619–13627. New York: IEEE.

19.

Caron

Touvron

Misra

, et al. Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, Montreal, QC, Canada, 10 October–17 October 2021, pp.9650–9660. New York: IEEE.

20.

Jabri

Joulin

, et al. Learning visual N-grams from web data. In: Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22 October–29 October 2017, pp.4183–4192. New York: IEEE.

21.

Sima

Renz

Chitta

, et al. DriveLM: driving with graph visual question answering. In: European conference on computer vision, 2024, pp.256–274. Cham: Springer Nature Switzerland.

22.

Zhang

Xie

, et al. DriveGPT4: interpretable end-to-end autonomous driving via large language model. IEEE Robot Autom Lett 2024; 9(10): 8186–8193.

23.

Zhang

, et al. Grounded language-image pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA, 18 June–24 June 2022, pp.10965–10975. New York: IEEE.

24.

Dai

Chen

Xiao

, et al. Dynamic head: unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Nashville, TN, USA, 20–25 June 2021, pp.7373–7382. New York: IEEE.

25.

Zhang

, et al. GLIPv2: unifying localization and VL understanding. In: Proceedings of the 36th international conference on neural information processing systems, New Orleans, LA, USA, 28 November–9 December 2022, pp.36067–36080. New York: Curran Associates.

26.

Zang

Zhou

, et al. Open-vocabulary DETR with conditional matching. In: European conference on computer vision, Tel Aviv, Israel, 23 October–27 October 2022, pp.106–122. Cham: Springer.

27.

Liu

Zeng

Ren

, et al. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. In: European conference on computer vision, 2024, pp.38–55. Cham: Springer Nature Switzerland.

28.

Cordts

Omran

Ramos

, et al. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27 June–30 June 2016, pp.3213–3223. New York: IEEE.

29.

Shao

Zhang

, et al. Objects365: a large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, Seoul, Korea, 27 October–2 November 2019, pp.8430–8439. New York: IEEE.

30.

Lin

Maire Belongie

, et al. Microsoft COCO: common objects in context. In: Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, 6–12 September 2014, pp.740–755. Cham: Springer.

31.

Kaplan

McCandlish

Henighan

, et al. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. Epub ahead of print 23 January 2020. DOI: 10.48550/arXiv.2001.08361.

A prompt-free vision-language model for environmental perception in automated driving systems

Abstract

Keywords

Get full access to this article

References