Sage Journals: Discover world-class research

Abstract

The widespread research and implementation of visual object detection technology have significantly transformed the autonomous driving industry. Autonomous driving relies heavily on visual sensors to perceive and analyze the environment. However, under extreme weather conditions, such as heavy rain, fog, or low light, these sensors may encounter disruptions, resulting in decreased image quality and reduced detection accuracy, thereby increasing the risk for autonomous driving. To address these challenges, we propose adaptive image enhancement (AIE)-YOLO, a novel object detection method to enhance road object detection accuracy under extreme weather conditions. To tackle the issue of image quality degradation in extreme weather, we designed an improved adaptive image enhancement module. This module dynamically adjusts the pixel features of road images based on different scene conditions, thereby enhancing object visibility and suppressing irrelevant background interference. Additionally, we introduce a spatial feature extraction module to adaptively enhance the model's spatial modeling capability under complex backgrounds. Furthermore, a channel feature extraction module is designed to adaptively enhance the model's representation and generalization abilities. Due to the difficulty in acquiring real-world data for various extreme weather conditions, we constructed a novel benchmark dataset named extreme weather simulation-rare object dataset. This dataset comprises ten types of simulated extreme weather scenarios and is built upon a publicly available rare object detection dataset. Extensive experiments conducted on the extreme weather simulation-rare object dataset demonstrate that AIE-YOLO outperforms existing state-of-the-art methods, achieving excellent detection performance under extreme weather conditions.

Keywords

Autonomous driving object detection extreme environments deep learning adaptive image enhancement YOLO

Introduction

Object detection plays a critical role in autonomous driving, as it facilitates the recognition and localization of diverse objects on the road, such as vehicles, pedestrians, and bicycles. Numerous researchers have made remarkable contributions to the field of autonomous driving. For instance, Dewangan and Sahu¹ wrote a comprehensive review in the field of autonomous driving, focusing on the advancements in intelligent vehicle system design and development across different stages, challenges, and domains. Liang and Huang² provided a comprehensive summary of pertinent techniques for 3D object detection in autonomous driving and discussed practical challenges and future development trends in this research domain. Dewangan and Sahu³ proposed a new lane detection framework based on a two-layer deep learning approach for multiple images under different weather conditions. Xie et al.⁴ developed a model predictive control algorithm that enables adaptable and dependable vehicle control through autonomous steering. Wang et al.⁵ proposed a method for model predictive control based on variable prediction horizons, which enhances vehicle path tracking performance. Nevertheless, conducting object detection in extreme environments presents substantial challenges that demand overcoming various complexities and difficulties. These extreme scenarios severely impact visibility and image quality, including heavy rain, dense fog, intense light, and nighttime conditions. Consequently, they impose significant hurdles and adversely affect the performance of object detection algorithms. In such situations, achieving accurate and reliable object detection becomes particularly arduous due to the degraded image quality and reduced visibility, necessitating robust solutions to surmount these obstacles.

To tackle this formidable challenge, numerous researchers have dedicated their efforts to devising innovative methods and techniques to enhance object detection performance in extreme environments. These approaches encompass various aspects, such as image enhancement, domain adaptation, model optimization, and adaptive image enhancement. Dong et al.⁶ proposed a multiscale denoising network for image enhancement. Similarly, Guo et al.⁷ used deep networks to enhance images at the pixel level by formulating low-light enhancement as an image-specific curve estimation.

Haq et al.⁸ propose an enhanced approach for stereo vision, addressing challenges posed by radiometric variations, particularly in textureless, occluded, and discontinuous regions. However, these methods are limited to preprocessing image enhancement and cannot achieve real-time end-to-end object detection tasks.

Recently, domain adaptation has emerged as a technique to improve object detection performance by leveraging domain transfer. Chen et al.⁹ proposed image-level and instance-level domain adaptation components to reduce domain discrepancies and enhance the performance of the detector. Zhu et al.¹⁰ introduced a novel domain adaptation method for object detection, focusing on aligning regions directly related to object detection in two domains to improve the performance of the detector. He and Zhang¹¹ proposed a method for object detection in unconstrained environments by leveraging domain knowledge learned from an auxiliary source domain with sufficient labels. Saito et al.¹² proposed a method for adapting detectors based on Faster R-CNN, leveraging strong local matching and weak global matching.

However, these domain adaptation networks may overlook potential informative features in the images while aligning the features of two distributions. Thus, a research hotspot enables networks to fully utilize original image information for image restoration. Huang et al.¹³ alleviated image degradation by designing two subnetworks and proposed feature restoration to enhance image visibility. Li et al.¹⁴ introduced a CNN-based image deblurring model capable of producing clear images, which can be seamlessly integrated into any detector to enable end-to-end detection capabilities. In pursuit of adaptive image enhancement, Liu et al.¹⁵ proposed a small-scale CNN tailored for adaptive image enhancement, showcasing promising outcomes in object detection under extreme weather conditions.

Various extreme weather conditions give rise to unique physical characteristics, resulting in distinct properties and challenges in visual sensor imaging. The detection methods mentioned above lack comprehensive high-precision detection capabilities and sufficient stable generalization ability. Additionally, the scarcity of large-scale scene data under various extreme weather conditions poses difficulties for learning-based methods to achieve optimal performance through supervised training.

To address the detection challenges in extreme environments, this article proposes a novel object detection method called Adaptive Image Enhancement (AIE)-YOLO. It can adaptively enhance the quality of degraded images and effectively improve the detection performance under various extreme weather conditions. Specifically, an AIE module is proposed to dynamically adjust the pixel-level characteristics of degraded images, providing enhanced visual images for the subsequent detection network. This process effectively suppresses irrelevant background interference. The AIE module comprises a small convolutional block, a contrast filter, and a sharpening filter. Multiple weight parameters are predicted through the small convolutional block, which jointly adjusts and guides the two filters, resulting in improved image quality. Additionally, we propose a spatial feature extraction (SFE) module to adaptively adjust the weights of each feature position based on scene semantics. By integrating spatial attention mechanisms, the SFE module enhances the spatial semantic modeling capability of AIE-YOLO, thereby improving detection performance in images with complex backgrounds. Furthermore, we introduce a channel feature extraction (CFE) module to adaptively select and adjust the weights between feature channels. The CFE module incorporates a channel attention mechanism to focus on salient channel features and reduce reliance on nonsignificant channels, thus enhancing the overall performance of the model. Figure 1 illustrates the simplified end-to-end pipeline of our proposed AIE-YOLO, where an image taken in a dense fog road scene undergoes adaptive image enhancement through the AIE module before being subjected to object detection by the improved detector.

Figure 1.

The simplified end-to-end pipeline of our proposed AIE-YOLO.

We construct a novel benchmark dataset named extreme weather simulation-rare object dataset (EWS-ROD) based on the publicly available ROD to address the challenge of scarce scene data in extreme weather conditions. EWS-ROD covers 10 types of extreme weather simulations, including torrential rain, fog, drop, darken, blur, brighten, snow, sunflare, darken and torrential rain, and drop and torrential rain. This dataset facilitates the evaluation and validation of the performance of existing object detection algorithms in extreme weather scenarios and further enhances the data diversity for extreme scene object detection tasks. Qualitative and quantitative experiments on the EWS-ROD dataset demonstrate that our proposed AIE-YOLO achieves outstanding performance compared to existing detection algorithms. Our main contributions are as follows:

A novel object detection method, AIE-YOLO, is proposed to address the challenges posed by extreme weather conditions and effectively adjust the quality of degraded images, significantly improving overall detection accuracy.

An effective AIE module is designed to dynamically adjust the pixel-level features of degraded images, enhancing image quality and suppressing irrelevant background interference.

To improve detection performance in complex backgrounds, we introduce an SFE module that adaptively adjusts the weights of each position in the image based on scene semantics. Additionally, a CFE module is proposed to adaptively select and adjust the weight parameters of feature channels, thereby enhancing the representation and generalization capabilities of AIE-YOLO.

Materials and methods

Extreme weather simulated ROD dataset

We construct a novel benchmark dataset called EWS-ROD, covering 10 categories of extreme scenes. EWS-ROD is built on the publicly available ROD dataset and aims to address the data scarcity issue for object detection tasks in extreme weather conditions. The ROD dataset, established and released by NullMax,¹⁶ is a comprehensive dataset in the field of autonomous driving. It comprises 10,000 training images and 4000 validation images, offering a substantial collection for training and testing object detection models. The ROD dataset covers diverse scenarios, encompassing various road levels, such as highways, expressways, urban streets, and rural roads. Furthermore, the ROD dataset includes seven common representative object categories, namely cars, trucks, buses, pedestrians, bicycles, tricycles, and motorcycles. To simplify the object categories and enhance training efficiency, we initially group bicycles, motorcycles, and tricycles together as small vehicles to construct our EWS-ROD dataset. Additionally, we select 50% of the data from the ROD dataset to perform the simulation of 10 different extreme weather conditions. We utilized an image enhancement algorithm called automold that can simulate various climates and environments. This algorithm is developed based on Python's Numpy and OpenCV libraries, enabling the transformation of images to various climatic conditions and seasons. Data partitioning is as follows: 10% of the dataset was transformed into the torrential rain scene, another 10% into the fog scene, 6% into the dark and rainy scene, 6% into the drop and rainy scene, 5% into the drop scene, 5% into the darken scene, 5% into the blur scene, 1% into the brighten scene, 1% into the snow scene, and 1% into the sunflare scene. The images in the dataset are uniformly resized to $640 \times$ 640 pixels. Prior to entering the network, the images undergo random cropping, flipping, rotation, and scaling to enhance the robustness and generalization ability of the model. Some vivid examples from our EWS-ROD dataset are depicted in Figure 2.

Figure 2.

Ten types of extreme weather simulated ROD.

Network overview

We propose a novel object detection architecture based on YOLOv8, as illustrated in Figure 3. The architecture primarily comprises three key modules: the backbone, neck module, and head module. The components and characteristics of the architecture are as follows:

Figure 3.

The overall architecture of the proposed AIE-YOLO.

Backbone: The backbone network is the fundamental structure, utilizing the CSPDarkNet¹⁷ architecture to extract features from the input image and generate feature maps at different scales. It typically consists of multiple convolutional layers aimed at capturing both low-level and high-level features. Within the backbone network, the C2f module is employed to enhance gradient flow, while the SPPF module is utilized for aggregating feature information across different scales. In the AIE-YOLO backbone network, three additional modules have been incorporated. The AIE module is responsible for image enhancement, the SFE module is dedicated to spatial feature information extraction, and the CFE module focuses on extracting semantic feature information.

Neck: The neck module is a crucial component positioned between the backbone and the head in the architecture. It efficiently performs top-down and bottom-up feature fusion by leveraging feature pyramid network¹⁸ and path aggregation network,¹⁹ which aggregate features of different scales and enhance feature representations. In AIE-YOLO's neck module, the CFE module is embedded after some feature fusion steps to further enhance the model's feature extraction capability.

Head: The head module generates three prediction heads responsible for predicting bounding boxes, class probabilities, and other relevant information for object detection. It utilizes a decoupled-head approach, employing two separate convolutional layers for classification and regression tasks.

Adaptive image enhancement module

We introduce an adaptive image enhancement module designed to dynamically adjust pixel features in road images according to varying scene conditions, thereby improving object visibility and reducing interference from irrelevant background elements, as depicted in Figure 4. This module comprises a CNN module, a sharpening filter, and a contrast filter. The CNN module analyzes input image data and forecasts the parameters essential for both image filters, facilitating effective image enhancement. The specifics of this module are outlined as follows.

Figure 4.

The proposed adaptive image enhancement module.

CNN module: The CNN module comprises three convolutional kernels, each sized at $3 \times 3$ , a stride of 2, and padding of 1. Additionally, it integrates three ReLU activation functions, two max-pooling layers sized at $2 \times$ 2, a single global average pooling layer measuring $10 \times$ 10, and a fully connected layer. Through the sequential connection of these elements, the CNN module adeptly processes information from the input image, facilitating the prediction of parameters essential for the two filters.

Contrast filter: The contrast filter, also known as the contrast enhancement filter, is a digital image processing technique used to modify the contrast of an image. Its purpose is to increase or decrease the difference in intensity values between pixels to enhance the visibility and clarity of the image.

In our method, we begin by applying adaptive weighting to the three color channels (red, green, and blue) of an RGB image at each pixel. This adaptive weighting is based on the local characteristics of the image and assigns different weights to the color channels. The purpose of this adaptive weighting scheme is to enhance the contrast in each channel based on their specific contributions to the image's overall appearance. Mathematically, this can be described as follows:

F (u) = C o n c a t (W_{r} r (u), W_{g} g (u), W_{b} b (u)),

(1)

where $r (u)$ , $g (u)$ , and $b (u)$ represent the pixel values of the red, green, and blue color channels, respectively, at a specific pixel location $u = (x, y)$ in the RGB image. The weights $W_{r}, W_{g}$ , and $W_{b}$ represent the importance of each channel, which are predicted by a CNN module. $C o n c a t$ represents the concatenation of features along the channel dimension.

Following that, we proceed with contrast filter enhancement, which can be described by the following equation:

E (u) = u \cdot \frac{(1 - \cos (π \cdot F (u)))}{2 \cdot F (u)},

(2)

I_{contrast} (u, w) = w \cdot E (u) + (1 - w) \cdot u,

(3)

where E represents the function mapping and w is used to adjust the effect of contrast enhancement, which is predicted by a CNN module.

Sharpen filter: The main purpose of the sharpening filter is to enhance the edges and details of an image, making it appear clearer and sharper. It achieves this goal by enhancing the high-frequency components of the image. Considering that in extreme environments, details may often be obscured or blurred, using a sharpening filter can effectively alleviate this issue. The sharpening process can be described by the following specific formula:

I_{sharp} (u, w) = u + w \cdot (u - I_{gaus} (u)),

(4)

where w represents the learnable weights that controls the strength of the sharpening effect, and

I_{gaus} (u) = I_{gaus} (x, y)

represents a Gaussian filter, and its formula is described as follows:

I_{gaus} (x, y) = \frac{1}{\sqrt{2 π σ^{2}}} \cdot \sum_{i = - k}^{k} \sum_{j = - k}^{k} \exp (- \frac{i^{2} + j^{2}}{2 σ^{2}}) \cdot I (x^{'}, y^{'}),

{\begin{matrix} x^{'} = x + i, \\ y^{'} = y + j, \end{matrix}

(5)

where $I (x^{'}, y^{'})$ represents the pixel value of the original image at coordinate $(x + i, y + j)$ , $σ$ represents the standard deviation of the Gaussian filter, which controls the degree of blurring of the filter, and k represents the radius of the Gaussian filter, indicating the range of pixel distances considered from the center pixel outward.

To efficiently utilize the feature information obtained from the original image, contrast filter, and sharpening filter, we concatenate them along the channel dimension. Considering their varying importance, we apply adaptive weighting to these three feature maps before combining them. The formula for this operation is as follows:

I_{o} = C o n c a t (α \cdot u, β \cdot I_{contrast}, γ \cdot I_{sharp}),

(6)

where $α$ , $β$ , and $γ$ represent the trainable weight parameters.

Spatial feature extraction module

To improve the model's ability to extract spatial feature details in extreme environments where feature information tends to be blurry, we propose an SFE module. This module consists of a feature extraction branch and a spatial attention branch. The feature extraction branch is responsible for further extracting informative features from the input images. Meanwhile, the spatial attention branch leverages spatial contextual information to compute attention weights, enhancing the model's ability to focus on important spatial regions. It selectively amplifies relevant spatial features while suppressing irrelevant or noisy features, thus improving the discriminative power of the extracted features. Finally, the outputs of the two feature maps are adaptively weighted and concatenated. SFE module enhances the model's capacity to capture relevant spatial details and attend to regions with rich information. By integrating feature embedding and attention mechanisms, it effectively highlights discriminative spatial features, significantly improving the blurriness of feature information in extreme environments. The detailed structure of the proposed module is illustrated in Figure 5.

Figure 5.

Architecture of spatial feature extraction module and channel feature extraction module.

Channel feature extraction module

We design a CFE module with the objective of extracting important information from different channels of feature maps and enhancing the network's feature extraction capability in extreme environments. The channel information extraction module consists of two main components: a feature extraction branch and a channel attention branch. The feature extraction branch is responsible for further extracting feature information from the input data. By leveraging the channel attention mechanism, the module effectively captures and emphasizes the most discriminative channel features. It selectively amplifies informative channels while suppressing less relevant ones, allowing the model to better understand the unique contributions of different channels and adaptively highlight the most relevant features. Finally, the outputs of the two feature maps are adaptively weighted and concatenated. The CFE module improves the model's ability to capture relevant channel details and focus on channels with rich information. By integrating feature embedding and attention mechanisms, it effectively highlights channel features with discriminative power, enhancing the model's extraction capability of abstract feature information. The specific structure of the module is illustrated in Figure 5.

Result

This study comprehensively evaluated object detection performance in extreme driving environments using the EWS-ROD dataset through comparative experiments and ablative studies. To measure the accuracy and robustness of the models, we utilized various evaluation metrics, including mean average precision (mAP) at Intersection over Union (IoU) threshold 0.5 (mAP@0.5), mAP at IoU threshold 0.75 (mAP@0.75), and mAP across IoU thresholds from 0.5 to 0.95 (mAP@0.5:0.95). These comprehensive metrics assess our model's object detection capabilities under challenging driving conditions.

Experimental environment

To ensure the reproducibility and validation of our proposed network model, we conducted experiments using the following settings. All experiments were performed on a computer running Ubuntu 18.04, equipped with an Intel i9-12900KF processor and an NVIDIA^® GeForce RTX 3090 GPU with 24 GB VRAM. The system has 32 GB of memory. The experimental environment was set up with Python 3.9 and the deep learning framework PyTorch 1.13.1, along with Cuda version 11.6. We initialized the learning rate to 0.001 and set the weight decay decay coefficient to 0.0005. The momentum parameter was set to 0.937. The batch size for training was set to 32. These selected experimental settings were intended to ensure reliable and effective evaluation of the model's performance.

Evaluation metrics

Table 1 presents the classification of real and predicted scenarios, where true positives (TP) stands for true positives, representing cases where both the predicted and real scenarios are positive. False positives (FP) refers to false positives, indicating cases where the predicted scenario is positive, but the real scenario is negative. False negatives (FN) represents false negatives, indicating cases where the predicted scenario is negative while the real scenario is positive. Lastly, TN denotes true negatives, representing cases where both the predicted and real scenarios are negative. The evaluation metrics used in this research include average precision (AP) and mAP. The mAP serves as the primary performance measure for object detection algorithms and is widely employed to evaluate the effectiveness of object detection models. A higher mAP value indicates superior detection performance on the given dataset. The precision-recall (P-R) curve is constructed by plotting precision on the vertical axis and recall on the horizontal axis. The area under the P-R curve (AP) corresponds to the AP value, while mAP is the average of the AP values across all categories. Furthermore, IoU quantifies the overlap between predicted and ground truth bounding boxes, providing an important measure of detection accuracy. The performance metrics in this study follow the standard COCO³² style metrics. mAP@0.5 signifies the mAP value computed with an IoU threshold of 0.5, whereas mAP@0.75 represents the mAP value with an IoU threshold of 0.75. Additionally, mAP@0.5:0.95 represents the average mAP value across different IoU thresholds ranging from 0.5 to 0.95, with a step size of 0.05. These metrics provide a comprehensive evaluation of the model's performance under various overlap conditions and contribute to the thorough analysis of the object detection results.

Table 1.

Confusion matrix.

		Predicted
		Positive	Negative
Actual	Positive Negative	True Positive (TP) False Positive (FP)	False Negative (FN) True Negative (TN)

Based on the provided confusion matrix in Table 1, we can define the precision (P) metric as the ratio of TP to the sum of TP and FP, expressed as $\frac{T P}{T P + F P}$ . Similarly, the recall (R) metric is defined as the ratio of TP to the sum of TP and FN, given by $\frac{T P}{T P + F N}$ . To calculate the AP, we integrate the precision values $P (r)$ over the range of recall values from 0 to 1. This is achieved by computing the area under the precision-recall curve. Moreover, the mAP can be determined as the average of the AP values across all categories or classes, which provides a comprehensive assessment of the model's detection performance. The mAP can be represented as follows:

m A P = \frac{\sum_{i = 1}^{n} A P_{i}}{n},

(7)

where n represents the number of classes.

Ablation experiments on EWS-ROD

In this study, we present AIE-YOLO, an advanced object detection method specifically designed for optimal performance in extreme weather conditions. AIE-YOLO incorporates our novel AIE, SFE, and CFE modules, each playing a crucial role in dynamically improving image quality, enhancing spatial semantic modeling, and emphasizing important channel features. To validate the significance of our contributions, we will conduct meticulous ablation experiments, systematically evaluating the impact of each module on the model's performance in challenging weather scenarios. The experimental results are recorded in Table 2.

Table 2.

Ablations analysis on the EWS-ROD dataset.

Baseline	AIE	SFE	CFE	mAP@0.5(%)	mAP@0.75(%)	mAP@0.5:0.95(%)
√	X	X	X	41.5	28.4	26.4
√	√	X	X	42.3	28.8	27.3
√	√	√	X	42.6	29.3	27.7
√	√	√	√	43.4	30.1	28.6

EWS-ROD: extreme weather simulation-rare object dataset; AIE: adaptive image enhancement; SFE: spatial feature extraction; CFE: channel feature extraction; mAP: mean average precision.

Based on the results in Table 2, we can evaluate the significance of each module in the AIE-YOLO object detection method. The AIE module increases mAP@0.5 from 41.5% to 42.3%, improves mAP@0.75 from 28.4% to 28.8%, and raises mAP@0.5:0.95 from 26.4% to 27.3%. These results demonstrate that the AIE module significantly enhances feature information in images, thereby improving object detection performance. Subsequently, the introduction of the SFE module further enhances performance, with mAP@0.5 reaching 42.6%, mAP@0.75 increasing to 29.3%, and mAP@0.5:0.95 rising to 27.7%. This highlights the SFE module's ability to better capture spatial information of objects, contributing to an overall improvement in accuracy. Additionally, the introduction of the CFE module demonstrates its significant impact on model performance. The mAP@0.5 reaches its highest value of 43.4%, mAP@0.75 further increases to 30.1%, and mAP@0.5:0.95 rises to 28.6%, underscoring the importance of CFE in emphasizing relevant channel features and leading to enhanced detection accuracy. Overall, the comprehensive evaluation metrics for the complete AIE-YOLO model clearly reflect the critical contributions of each module. The AIE, SFE, and CFE modules collectively enhance object detection accuracy under extreme weather conditions. These results validate the importance and effectiveness of each module in achieving superior object detection performance.

To validate the effectiveness of each filter in the AIE module, we conducted ablation experiments on the filters. The experimental procedure is as follows: we first removed the sharpen filter from the AIE module of AIE-YOLO, then removed the contrast filter, and finally added both filters back in. The results are shown in Table 3. Analysis of the experimental results indicates a significant decrease in performance after removing the contrast filter and sharpening filter. This thoroughly validates the effectiveness of these two filters for our task.

Table 3.

The ablation experiment of the AIE module conducted on the EWS-ROD dataset.

contrast filter	sharpen filter	mAP@0.5(%)	mAP@0.75(%)	mAP@0.5:0.95(%)
√	X	43.1	29.5	27.6
X	√	42.6	29.3	27.2
√	√	43.4	30.1	28.6

EWS-ROD: extreme weather simulation-rare object dataset; AIE: adaptive image enhancement; mAP: mean average precision.

Comparative experiments on EWS-ROD

To validate the effectiveness of the proposed AIE-YOLO model, we replicated a wide range of object detection algorithms and conducted comprehensive comparative experiments. The primary objective was to compare its performance with various object detection models across a range of evaluation metrics. Through these experiments, we aimed to demonstrate the superiority of our proposed approach in addressing the challenges of object detection under extreme weather conditions. As shown in Table 4, our approach outperforms all other single-stage and two-stage object detection models, achieving the best performance across the metrics of mAP@0.5, mAP@0.75, and mAP@0.5:0.95. Specifically, compared to the best-performing two-stage model, Cascade-RCNN,²² our method demonstrates superiority by 1.7%, 1%, and 1.5% in terms of mAP@0.5, mAP@0.75, and mAP@0.5:0.95, respectively. Moreover, when compared to single-stage models, our approach shows even more significant improvements. Compared to the baseline YOLOv8, it achieves higher mAP@0.5 by 1.9%, mAP@0.75 by 1.7%, and mAP@0.5:0.95 by 2.2%. Similarly, when compared to RT-DETR,³¹ our approach exhibits superior performance with 4.9% higher mAP@0.5, 5.4% higher mAP@0.75, and 5.5% higher mAP@0.5:0.95. These comparative results firmly establish the effectiveness of AIE-YOLO in object detection under extreme environmental conditions.

Table 4.

The quantitative experimental results on the EWS-ROD dataset.

	Method	mAP@0.5(%)	mAP@0.75(%)	mAP@0.5:0.95(%)
Two-stage	Fast R-CNN²⁰	40.1	25.3	23.4
	Faster R-CNN²¹	41.3	26.5	25.2
	Cascade R-CNN²²	41.7	29.1	27.1
One-stage	SSD300²³	35.3	21.7	21.5
	RetinaNet²⁴	37.3	23.1	22.8
	FCOS²⁵	39.6	25.6	23.9
	YOLOR²⁶	40.8	25.3	24.8
	YOLOv5-s	41.4	28.3	26.8
	YOLOX-s²⁷	41.6	28.1	26.4
	YOLOv6-s²⁸	41.9	28.9	27.1
	YOLOv7²⁹	42.1	26.3	25.8
	YOLOv8-s	41.5	28.4	26.4
	RTMDet³⁰	40.7	27.9	26.2
	RT-DETR³¹	38.5	24.7	23.1
	AIE-YOLO (Ours)	43.4	30.1	28.6

EWS-ROD: extreme weather simulation-rare object dataset; AIE: adaptive image enhancement; mAP: mean average precision.

To further demonstrate the effectiveness of our modular approach and model, and to make the analysis more objective, we selected four representative methods and embedded the AIE module into these four methods. We selected these models based on the following considerations: cascade R-CNN and FCOS represent classic examples of two-stage and one-stage object detection models, respectively. The reason for choosing YOLOv6 and YOLOv7 is because they are the latest models in the YOLO series, which are widely used in the field of object detection. By conducting experiments with these models, we can demonstrate more convincing experimental results. As shown in Table 5, after incorporating the AIE module, the mAP@0.5, mAP@0.75, and mAP@0.5:0.95 metrics of these four methods all showed improvement, demonstrating the effectiveness of our AIE module. At the same time, our overall model still represents state-of-the-art results, validating the effectiveness of the methods we proposed.

Table 5.

Verify the effectiveness of AIE by using different baselines on the EWS-ROD dataset.

	Method	mAP@0.5(%)	mAP@0.75(%)	mAP@0.5:0.95(%)
Two-stage	Cascade R-CNN	41.7	29.1	27.1
Two-stage	AIE-Cascade R-CNN	42.2	29.3	27.6
One-stage	FCOS	39.6	25.6	23.9
	AIE-FCOS	40.6	25.9	24.7
	YOLOv6-s	41.9	28.9	27.1
	AIE-YOLOv6-s	42.3	29.1	27.3
	YOLOv7	42.1	26.3	25.8
	AIE-YOLOv7	42.5	27.1	26.2
	AIE-YOLO (Ours)	43.4	30.1	28.6

AIE: adaptive image enhancement; mAP: mean average precision.

We also conducted a comparative analysis of our model with the single-stage YOLO series models v5 s, v6 s, and v8 s, as well as the two-stage Faster R-CNN model, in terms of both speed and parameter count, as depicted in Table 6. After analysis, the following conclusions can be drawn: YOLOv8 s exhibits the fastest speed, while YOLOv5 s stands out as the most lightweight option. Our method slightly lags behind other models in terms of speed and parameter count due to the incorporation of adaptive image enhancement modules and attention mechanisms. However, compared to the Faster R-CNN model, our method still holds significant advantages.

Table 6.

The inference speed and parameter count of our proposed method are compared with other methods.

	Method	Speed (ms)	Parameters (M)
Two-stage	Faster R-CNN	102.5	41.7
One-stage	YOLOv5-s	12.8	7.2
	YOLOv6-s	11.3	18.5
	YOLOv8-s	10.7	11.2
	AIE-YOLO (Ours)	22.8	20.3

AIE: adaptive image enhancement.

Perceptual comparisons

In extreme weather conditions, our proposed AIE-YOLO aims to enhance the accuracy of road object detection by adaptively addressing the challenge of degraded visual image quality. To intuitively demonstrate the effectiveness and superiority of our method, we prepared 10 types of road images captured under extreme environmental conditions, including torrential rain, fog, drop, darken, blur, brighten, snow, sunflare, darken and torrential rain, and drop and torrential rain. For each of these extreme weather images, we utilized YOLOv8 and AIE-YOLO for object detection and compared the results through visualizations. As shown in Figure 6, it is evident that AIE-YOLO consistently outperforms YOLOv8 in various extreme weather conditions,

Figure 6.

The visualization results under 10 extreme weather conditions.

demonstrating its advanced performance. When comparing it with the Ground Truth, it is noticeable that YOLOv8 exhibits many instances of missed detections, false positives, and misclassifications. In contrast, AIE-YOLO shows relatively stable results with better accuracy and recall rates.

Conclusions

In this article, we proposed AIE-YOLO, a novel object detection algorithm aimed at improving the accuracy of road object detection under extreme weather conditions and addressing the problem of degraded visual image quality. The algorithm incorporates an adaptive image enhancement module that dynamically adjusts pixel features based on scene conditions, enhancing object visibility and suppressing irrelevant interference. Additionally, a spatial feature extraction module improves spatial modeling capabilities in cluttered backgrounds, while a channel feature extraction module enhances representation and generalization capabilities. The construction of the EWS-ROD dataset further strengthens research on extreme weather scenarios and contributes to evaluating future algorithms. Extensive experiments on the EWS-ROD dataset demonstrate that AIE-YOLO outperforms existing methods, making it an effective solution for object detection in extreme weather conditions. Our work enhances the safety and reliability of autonomous driving. It takes a significant step toward improving the robustness and applicability of object detection technology in real-world extreme weather conditions. While our proposed method can dynamically adjust the pixel features of road images based on different scenario conditions, these adjustments are still reliant on predefined weather scenario conditions. This implies that our algorithm may not fully encompass the variability of actual complex weather conditions. Enhancing the model's generalization capability to make it applicable to the majority of complex weather situations is thus our next planned endeavor.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China under grants 62072211, 51939003, and U20A20285.

ORCID iD

Yu Jiang

Author biographies

Qianren Guo received his BS degree from the School of Science and Information Science, Qingdao Agricultural University, China, in 2022, and he is currently pursuing master's degree from the College of Software, Jilin University, China. His research fields include deep learning, machine vision, semantic segmentation and object detection.

Yuehang Wang received his BS degree from the College of Computer Science and Technology, Wuhan University of Science and Technology, China, in 2018, and his MS degree from the College of Software, Jilin University, China, in 2021. He is pursuing a PhD at the College of Computer Science and Technology, Jilin University, China. His research fields include deep learning, Internet of Things, machine vision and object detection.

Yongji Zhang received his BS degree in electronic information engineering from the College of Defense Education, Heilongjiang University, China, in 2019, and his MS degree from the College of Computer Science and Technology, Jilin University, China, in 2023. He is currently pursuing a PhD at the College of Computer Science and Technology, Jilin University, China. His main research interests are computer vision and motion, such as applications of machine vision, video frame interpolation, and Internet of Things.

Minghao Zhao received his MS in computer science and technology from Jilin University, China, in 2020. He is pursuing a PhD at the College of Earth Sciences, Jilin University. His research interests include machine learning, remote sensing image processing, and machine vision.

Yu Jiang is an associate professor with the College of Computer Science and Technology, Jilin University, China. He received his MS and PhD degrees from the College of Computer Science and Technology, Jilin University, China, in 2005 and 2011, respectively. His research fields include Internet of Things, intelligent transportation, artificial intelligence, and machine vision.

References

Dewangan

Sahu

. Towards the design of vision-based intelligent vehicle system: methodologies and challenges. Evol Intell 2023; 16: 759–800.

Liang

Huang

. Survey on deep learning-based 3d object detection in autonomous driving. Trans Inst Meas Control 2023; 45: 761–776.

Dewangan

Sahu

. Lane detection in intelligent vehicle system using optimal 2-tier deep convolutional neural network. Multimed Tools Appl 2023; 82: 7293–7317.

Xie

Wang

, et al. Adaptive coordination control strategy for path-following of ddaev with neural network based moving weight coefficients. Trans Inst Meas Control 2023; 45: 1282–1297.

Wang

Chen

, et al. Path tracking based on model predictive control with variable predictive horizon. Trans Inst Meas Control 2021; 43: 2676–2688.

Dong

Pan

Xiang

, et al. Multi-scale boosted dehazing network with dense feature fusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.2157–2167.

Guo

, et al. Zero-reference deep curve estimation for low-light image enhancement. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.1780–1789.

Haq

Lin

Ruan

, et al. An edge-aware based adaptive multi-feature set extraction for stereo matching of binocular images. J Ambient Intell Humaniz Comput 2022; 13; 1953–1967.

Chen

Sakaridis

, et al. Domain adaptive faster R-CNN for object detection in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp.3339–3348.

10.

Zhu

Pang

Yang

, et al. Adapting object detectors via selective cross-domain alignment. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.687–696.

11.

Zhang

. Multi-adversarial faster-rcnn for unrestricted object detection. In: Proceedings of the IEEE/CVF International conference on computer vision, pp.6668–6677.

12.

Saito

Ushiku

Harada

, et al. Strong-weak distribution alignment for adaptive object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.6956–6965.

13.

Huang

Jaw

. Dsnet: joint semantic learning for object detection in inclement weather conditions. IEEE Trans Pattern Anal Mach Intell 2020; 43: 2623–2633.

14.

Peng

Wang

, et al. Aod-net: all-in-one dehazing network. In: Proceedings of the IEEE international conference on computer vision, pp.4770–4778.

15.

Liu

Ren

, et al. Image-adaptive yolo for object detection in adverse weather conditions. In: Proceedings of the AAAI conference on artificial intelligence, volume 36, pp.1792–1800.

16.

Song

Zhang

, et al. Traffic context aware data augmentation for rare object detection in autonomous driving. In: 2022 international conference on robotics and automation (ICRA), pp.4548–4554: IEEE.

17.

Bochkovskiy

Wang

Liao

HYM

. Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:200410934 2020.

18.

Lin

Dollár

Girshick

, et al. Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp.2117–2125.

19.

LiuS

Qin

, et al. Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp.8759–8768.

20.

Girshick

. Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp.1440–1448.

21.

Ren

Girshick

, et al. Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems 2015; 28.

22.

Cai

Vasconcelos

. Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp.6154–6162.

23.

Liu

Anguelov

Erhan

, et al. SSD: single shot multibox detector. In: Computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp.21–37: Springer.

24.

Lin

Goyal

Girshick

, et al. Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp.2980–2988.

25.

Tian

Shen

Chen

, et al. Fcos: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp.9627–9636.

26.

Wang

Yeh

Liao

HYM

. You only learn one representation: unified network for multiple tasks. arXiv preprint arXiv:210504206 2021.

27.

Liu

Wang

, et al. Yolox: exceeding Yolo series in 2021. arXiv preprint arXiv:210708430 2021.

28.

Jiang

, et al. Yolov6: a single-stage object detection framework for industrial applications. arXiv preprint arXiv:220902976 2022.

29.

Wang

Bochkovskiy

Liao

HYM

. Yolov7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.7464–7475.

30.

Lyu

Zhang

Huang

, et al. Rtmdet: an empirical study of designing real-time object detectors. arXiv preprint arXiv:221207784 2022.

31.

Zhao

, et al. Detrs beat yolos on real-time object detection. arXiv preprint arXiv:230408069 2023.

32.

Lin

Maire

Belongie

, et al. Microsoft coco: common objects in context. In: Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, Part V 13, pp.740–755: Springer.

AIE-YOLO: Effective object detection method in extreme driving scenarios via adaptive image enhancement

Abstract

Keywords

Introduction

Materials and methods

Extreme weather simulated ROD dataset

Network overview

Adaptive image enhancement module

Spatial feature extraction module

Channel feature extraction module

Result

Experimental environment

Evaluation metrics

Ablation experiments on EWS-ROD

Comparative experiments on EWS-ROD

Perceptual comparisons

Conclusions

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Author biographies

References