Abstract
Environmental perception is a critical component of automated driving systems. Advancing environmental perception algorithms toward applications in open-world road scenarios is a current research trend. However, traditional single-modality detectors exhibit weak generalization capabilities, while modern vision-language models require manual input text prompts to function properly. Since conventional methods fail to meet the requirements of open-world environmental perception, this study proposes a prompt-free vision-language model to address this problem. The proposed model first introduces a prompt memory pretraining strategy, which stores text prompt memory by pretraining the model on large-scale object detection datasets. Subsequently, a dynamic prompt generation module is proposed to identify foreground categories within the vision modality input. It queries the prior text prompt memory to generate the corresponding text prompts. Extensive experiments demonstrate that the proposed method significantly outperforms conventional detectors and modern vision-language models, even when relying solely on vision modality input. The code and trained models are available at https://github.com/unbelieboomboom/prompt_free_G_DINO.
Keywords
Get full access to this article
View all access options for this article.
