Abstract
Autonomous driving is a complex system which includes perception, cognition and control functions. Environmental perception represented by target detection is an essential part of autonomous driving technology. To improve the performance of target detection, the model is constructed based on YOLOv3 and improved with three methods to detect the obstacles of vehicles/cyclists/pedestrians on the road. The double population genetic algorithm is used to enhance the K-means clustering analysis, which is named K-IGA algorithm to optimize the value of anchor box. According to the structure of the convolution module, the Batch Normalization (BN) layer is merged into the convolution layer to obtain a new output function. The calculation of the loss function is improved using the Generalized Intersection over Union (GIOU). Then, the training set and test set are made based on the data set KITTI. The YOLOv3 algorithm with different improvement methods is named MD2, MD3, MD4. They are trained and tested on the data set with the classical YOLOv5 algorithm. The results show that the MD4 algorithm with all the above three improvement methods has the highest accuracy. Compared with the benchmark model MD1 and YOLOv5. Its mAP is 7.8% higher than MD1′s and 0.55% higher than YOLOv5′s. The test results show that the detection model can effectively identify vehicles, pedestrians and cyclists in the road scene. The real-time performance meets the requirements of target detection during vehicle driving.
Keywords
Introduction
Autonomous driving technology can reduce the participation of drivers in the process of vehicle driving and improve driving safety. 1 Improving the detection accuracy of vehicles, cyclists and pedestrians in the road is one of the key technologies to realize autonomous driving.2,3
At present, the most effective target detection algorithm is the deep convolutional neural network,4,5 which integrates the three basic steps of target feature extraction, target classification and target location into a complete convolutional neural network structure. It can divide two categories: two-stage target detection network and single-stage target detection network.
The two-stage target detection network first obtains the target candidate box, then extracts the target features from the target candidate box and generates the detection results. At present, the two-stage target detection network has developed from RCNN to Faster RCNN. 6 Zeng et al. 7 integrated the anti-occlusion network into the standard Faster R-CNN detection algorithm, which obtained better accuracy and robustness of underwater target detection. Based on Faster R-CNN, Li et al. 8 constructed a five-layer structure fusion multi-target detection and recognition algorithm, which improved the accuracy and speed of target recognition in complex traffic environments.
The two-step algorithm has great detection precision, but whose real-time performance is still rugged to meet the requirements of autonomous driving. The most representative single-step algorithm is YOLO algorithm which omits the candidate region extraction process to improve the speed of target detection. 9 Through the continuous improvement of researchers, the YOLO series of algorithms have achieved excellent accuracy and real-time performance.10,11 Panigrahi and Raju 12 proposed an improved YOLOv3 network based on SqueezeNet architecture, and the performance of pedestrian detection verified the effectiveness of the proposed algorithm. Yi et al. 13 improved the network structure of tiny-yolov3. Experimental results showed that this method had high pedestrian detection accuracy under the premise of satisfying real-time performance. Ahmed et al. 14 combined Gaussian YOLOv3 with channel attention and feature interleaving module to learn the weight of each channel which enhanced the network’s ability to discriminate between people and background.
The clustering algorithm is a crucial enabler for convolutional neural networks. Martí Caro et al. 15 optimized the critical power and bandwidth of YOLOv3 by independent weight clustering for each neural network to shave bandwidth requirements down to 30%–40% and reduce energy consumption to 45%. Wang et al. 16 used the improved k-medians clustering method instead of the previous k-means to improve the model instability in YOLOv3 method and which achieved good detection results on the KITTI and UA-DETRAC public datasets. Dong et al. 17 introduced C3Ghost and Ghost modules into the neck network to reduce floating-point operations, and introduced the Convolution Block Attention Module into the backbone network to suppress unimportant information, which improved the operation speed and detection accuracy of YOLOv5.
Based on the above literature, it can be seen that optimizing the network structure and introducing the clustering method into the neural network are the main improvement measures to improve the YOLO model. However, clustering algorithm can easily fall into local optimal solution when searching for the cluster center. The evolutionary algorithms such as genetic algorithm, 18 particle swarm optimization can effectively improve the performance of clustering. 19
In this paper, the YOLOv3 algorithm for autonomous driving is improved with three methods. Self-made target and detection data sets are used to compare the performance of three improved YOLOv3 algorithms and classical YOLOv5 algorithms.
The main contributions of this study are summarized as follows:
Aiming at the best clustering performance, the double population genetic algorithm of Gauss mutation and Cauchy mutation is used to select the clustering center point, and the improved clustering algorithm is used to determine the value of the anchor box.
Improving the calculation method of GIOU loss function helps the model to obtain more information of small targets such as pedestrians and cyclists in the training process. Extracting more feature information is conducive to improving the detection accuracy of the model.
According to the convolution module structure, the convolution layer and the BN layer equation are combined to improve the forward reasoning speed.
The various improved models of YOLOv3 and the classical YOLOv5 model are trained and tested on the data set. The YOLOv3 algorithm including the three improved methods has the highest accuracy and can effectively identify small targets in road scenes.
The remainder of this paper is organized as follows. In Section 2, three improved methods of YOLOv3 are proposed in turn, which are using K-IGA algorithm to optimize the value of anchor box, using GIOU idea to improve the calculation method of loss function, and merging BN layer to convolution layer in neural network. In Section 3, we first prepare the data set and configure the computing environment, then use the performance evaluation index to analyze the detection accuracy of various improved models of YOLOv3 and the classical YOLOv5 model. The improved YOLOv3 model obtains better detection performance. In particular, it achieves higher detection accuracy in small targets characterized by pedestrians. Finally, Section 4 provides the conclusion and prospect of the improved YOLOv3 model for target detection in autonomous driving.
Construct an improved YOLOv3 target detection model
Overview of convolutional neural networks based on YOLOv3
The convolutional neural network has a variety of structural layers such as convolutional layer, pooling layer, input layer, output layer, and fully connected layer. 20 Because the image feature dimension after convolution operation is too large to calculate directly. The downsampling layer performs pooling operation, and uses the pooling box to gradually scan the input feature map to simplify the relevant information and complete the abstraction of practical information.
After several generations of improvement of the YOLO model, YOLOv3 increases the number of layers of the neural network, removes the pooling layer and the fully connected layer, accelerates the detection speed. So it is widely used in the area of object detection. Figure 1 shows the network structure of YOLOv3.

YOLOv3 network structure.
Before training the convolutional neural network, the training set should be processed with anchor box. The concept of anchor box comes from the Faster R-CNN algorithm, which is the prior value of the candidate target box set. In the YOLOv3 algorithm, anchor box constrains the range of the detection object by labeling several groups of different width and height values obtained by clustering the rectangular boxes on the training set, and the model is trained according to the constraint range.
The left backbone of the convolutional neural network is responsible for processing images to obtain feature mapping information on multiple scales. The information enters the feature fusion network on the right side, and the detection results are output through feature extraction and fusion on different scales. YOLOv3 uses a deeper network structure and adds a residual module to construct a more advanced Darknet-53 neural network skeleton. 21 This network structure contributes to the deeper dissemination of information, and achieves higher processing speed while taking into account the recognition accuracy.
Improvement of YOLOv3 target detection algorithm
Clustering anchor by K-IGA algorithm
YOLOv3 has been embedded in the clustering algorithm to obtain the required anchor box.18,21 However, the performance of the clustering algorithm is affected by the selection of the initial center, which is not easy to obtain a better clustering effectiveness, thus reducing the accuracy of the anchor box selection. To optimize the performance of the clustering algorithm, this paper refers to the Improved genetic algorithm (IGA). 22 According to the fitness value of each individual, IGA uses Gaussian mutation operator and Cauchy mutation operator to divide the population into two sub-populations to obtain excellent local search ability and global search ability. Then it is used to find the best center point of the k-means algorithm and named the K-IGA algorithm which is applied to the anchor box selection. The flow chart of the algorithm is shown in Figure 2 and describes as follows:
(1) N individuals are the initial population of genes generated by a random number, and the coordinate value of each k center is specified for the cluster center
(2) Let the data set Z contain h groups of data, which can be expressed as

The flow chart of the algorithm process.
(3) The population is processed by selecting and crossing according to the selection parameter
(4) Perform group mutation operations. The equation (2) calculates the proportional transformation function value of the new individual’s fitness. The individual performs Cauchy mutation when
Where
(5) The N individuals with the smallest fitness in the group after the previous step are extracted to complete the optimization mutation. The individual in the group is combined with the optimal individual
Where
(6) Compare the fitness value of the old and new optimal individual, if
(7) Query the current number of iterations, if
It can be seen that the fitness value corresponding to the optimal
Loss function modification
In some data sets, small-sized target objects such as cyclists and pedestrians only occupy a small part of the image. The training model obtains little practical information from the corresponding pixels, and the conventional method makes it difficult to train better recognition performance. Modifying the loss function may be a better method to improve the detection accuracy.
In YOLOv3, the concept of loss function is used to characterize the deviation between the predicted and actual values of the network model, which can be represented by equation (6):
Where
Where
Where A is the box predicted by the model, B is the actual target box, and
This calculation method has apparent shortcomings in two aspects:
(1) When
(2) The value of
Because of the deficiencies, this network uses the
Where C is the minimum circumscribed rectangle of A and B coverage area,
Compared with the traditional evaluation index
Modifying network structure
In the training process of convolutional neural networks, if the samples of different batches are unevenly distributed, the model consumes more time to adapt to the uneven sample distribution. The convolution module in YOLOv3 contains a BN layer, which is used to normalize the input data. This operation can solve the problem of uneven data distribution between layers, but it will increase the amount of computation during forward propagation and extend the target detection time. Combining the BN layer into the convolutional layer can improve the detection speed.
The convolution layer in the convolution module is as follows (12):
Where
For the BN layer in the convolution module, the operation process is as follows (13):
Where
According to the structure of the convolution module, the output of the convolution layer will be input into the BN layer, and the two modes will be merged into:
Target detection model training and application
Data set preparation and computing environment setting
To verify the accuracy of the target detection algorithm, the authoritative data set KITTI is selected for batch processing, 24 and 4904 pictures with 608 × 608 are obtained. Four thousand and one hundred pictures are randomly selected as training sets, 804 pictures as test sets. The distribution of sample types is balanced as far as possible. Among them, 15,016 vehicles are classified as “Car,” 7129 pedestrians are classified as “Pedestrian,” and 2886 cyclists are classified as “Cyclist.”
The training and testing of YOLO model is based on the self-built data processing platform. The platform uses Compute Unified Device Architecture(CUDA) parallel computing architecture to support Graphics Processing Unit(GPU) acceleration and improve the efficiency of the algorithm.
In the training process of test, the batch_size is 64, the subdivision is 32, the number of iterations of model training is 8000, the learning rate uses burn_in mode in the first 1000 iterations, the learning rate is 10e–3 from 1000 to 6400 iterations, the iterative learning rate is reduced to 10e–4 from 6400 to 7200 iterations, the learning rate is reduced to 10e–5 from 7200 to the end of iteration, and the overlap threshold is set to 0.5.
Performance evaluation index of model
To comprehensively measure the performance of the model, the following indicators are used to evaluate the effectiveness of the model:
Where
The recall rate expression is:
Where
Average Precision(AP)) is the evaluation index of the overall accuracy of the algorithm on a certain class.
Where
Frames Per Second(FPS) is the detection speed of the model for the test set and can be expressed as:
Where
Results and analysis of target the detection model
Based on different YOLOv3 algorithm improvement methods, with the unified sample set and test set, the performance of target detection is analyzed according to the performance evaluation index. Specifically, model MD1 uses YOLOv3′s own cluster analysis to determine the anchor box. Model MD2 uses the anchor box generated by K-IGA clustering. Model MD3 merges the BN layer in the convolution module to the convolution layer on the basis of MD2, and MD4 improves the loss function to GIOU on the basis of MD3.
Clustering anchor boxes by K-IGA algorithm
The YOLOv3 model defaults to use nine groups of anchors for learning on three feature scales. The values of these nine groups of anchors are derived from the clustering of the actual boxes of the COCO data set. To improve the selection of anchor value, this test uses the K-IGA algorithm to cluster the actual box of the self-made training set. Normalize the width and height of all the annotation files in the training set to make a table with two-dimensional information of width and height. The K-IGA algorithm is used for clustering, and the clustering center is the value of the Anchor of the data set. Figure 3 is the clustering result.

Anchor clustering results after normalization.
The array obtained by K-IGA algorithm clustering is the width and height of the normalized annotation box, which needs to be denormalized. The Anchor value obtained is shown in Table 1.
Anchor values of different models.
Different methods improve the performance of YOLOv3 model
The training results of YOLOv3 with different improved methods and classical YOLOv5 are shown in Figure 4. The accuracy of the model gradually increases and the loss function value decreases with the number of iterations increases. The loss function of model MD1 and MD2 tends to be stable around 2000 iterations, model MD3 tends to be stable around 4000 iterations, and model MD4 tends to be stable around 6000 iterations. From the precision recall curve, the accuracy of the car is the highest in the four YOLOv3 models and classical YOLOv5 model, and the accuracy of cyclists and pedestrians is relatively low. The precision recall curve of MD2 is better than MD1′s, indicating that the anchor box improved by K-IGA algorithm is better than the anchor box determined by the clustering algorithm in YOLOv3. Due to the improvement of MD3 network structure, the speed has improved, but the precision recall curve is slightly worse than MD2′s. Compared with MD4, the detection accuracy of cars and cyclists in classical YOLOv5 is slightly better, but the accuracy of pedestrians is not as good as that of MD4 model.

Performance curves of the training models. (a) area average IOU curve of MD1. (b) Average loss curve of MD1 (c) Precision-recall curve of MD1. (d) area average IOU curve of MD2. (e) Average loss curve of MD2. (f) Precision-recall curve of MD2. (g) area average IOU curve of MD3. (h) Average loss curve of MD3 (i) Precision-recall curve of MD3. (j) area average IOU curve of MD4. (k) Average loss curve of MD4. (l) Precision-recall curve of MD4. (m) Precision-recall curve of YOLOv5 Model. (n) Box_loss in training. (o) Obj_loss in training. (p) Cls_loss in training.
Table 2 shows the performance evaluation index of the models. It can be found that the mAP of MD1 is 0.8086, of which the AP of cars is 0.8765, and the AP of cyclists and pedestrians is 0.7865 and 0.7629, respectively. The performance of the model MD2 is greatly improved. Its mAP is increased to 0.8544, the AP of the cyclist is increased to 0.8723, and the AP of the pedestrian is increased to 0.7945. The FPS of model MD2 is lower than that of model MD1, which is reduced by 16.7%. The mAP of model MD3 is close to that of MD2. Because it combines the BN layer with the convolution layer in the convolution module, the FPS of MD3 is improved. Compared with MD3, MD4 has noticeable improvement in mAP, especially in pedestrians, and the FPS is consistent. Compared with MD2, MD4 has a significant improvement in the mAP of cars and pedestrians, but the AP of cyclists is slightly reduced. Compared with MD1, the mAP of each detection of MD4 is significantly improved, which is increased by 7.8%, the AP of the cars is increased by 4.8%, the AP of the cyclists is increased by 9.3%, the AP of the pedestrian is increased by 10.6%, and the FPS is increased by 4.1%. The mAP and AP of classical YOLOv5 model are better than the those of MD1, MD2 and MD3. Its mAP is 0.8665, which is slightly smaller than MD4′s, indicating that its detection accuracy is slightly lower than MD4. Its AP of cars and cyclists are slightly better than the MD4′s, but the AP of pedestrians is much smaller than the MD4′s, indicating that the detection performance for small targets is relatively poor. MD4 shows excellent performance as a whole.
AP, mAP, and FPS of different models.
Therefore, using the appropriate anchor box, improving the model structure and modifying the loss function can improve the average accuracy of the model.
Target detection effect verification
To verify the recognition effectiveness of the target detection model, the typical scene of the test set is selected by the MD4 model. Figure 5 shows the effect of scene detection. The model can detect three types of targets: vehicles, pedestrians and cyclists in complex traffic scenes, and output target classification information. The detection model can accurately extract the target object in the information-rich environment, and has a good recognition effect on distant, occluded, and partially appeared targets.

Test effect of actual scene.
Conclusions and prospects
YOLOv3 is improved with three methods and obtains better recognition performance. The data processing platform is constructed based on YOLOv3, and the anchor box parameters are determined by the clustering analysis of YOLOv3, thus forming MD1 as the benchmark model. The double population genetic algorithm improves the clustering analysis, and the K-IGA algorithm is obtained. The value of the anchor box is optimized by the K-IGA algorithm, and the MD2 model is constructed. On the basis of MD2, the BN layer is merged into the convolutional layer to obtain the MD3 model. Based on MD3, the GIOU idea is used to improve the calculation method of the loss function, and the MD4 model is obtained.
A total of 4904 images including vehicles, pedestrians and cyclists are produced as data sets for model training. The results show that the mAP and AP of MD2 are better than MD1, but the FPS is reduced by 16.7%. MD3 has almost no decrease in mAP while optimizing the FPS. The overall performance of MD4 is better than that of MD2 and MD3. Compared with MD1, the mAP of MD4 is improved by 7.8%, the AP of cars is improved by 4.5%, the AP of cyclists is improved by 8.5%, the AP of pedestrians is improved by 9.6%, and the FPS is improved by 4.1%. Considering the detection accuracy and speed, MD4 obtains the best target detection performance. Compared with classical YOLOv5, MD4 is slightly better than YOLOv5 as a whole.
Although the improved model of yolov3 has shown promising results, there are still some areas worthy of further research and development.
The perception of the actual road environment should not be limited to vehicles, cyclists and pedestrians. Traffic lights, signal boards, and other types of obstacles can be further added to identify targets, especially the identification of occluded obstacles, to improve the universality of the target detection model. In the process of optimizing the structure of convolutional neural network, the interaction between different structures and anchor box parameters on detection accuracy can be further studied.
Footnotes
Ethical Considerations
This work did not involved humans and animals. Ethic approval was not required for this research.
Consent to participate
There is no such case.
Consent for publication
The corresponding author gave consent for the publication of the identifiable details.
Author contributions
Concept and design: Wei Yan and Jiashu Ji; data collection and analysis: Jiashu Ji and Lingzhi Xu; drafting of the article: Jiashu Ji, Wei Yan and Fangzhe Sun; critical revision of the article for important intellectual content: Wei Yan and Jiashu Ji. All the authors approved the final article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by China Shandong province Key Research and Development Program of Grant No.2023CXGC010210 and China Shandong province Key Research and Development Program of Grant No. 2020CXGC011004.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
Data will be made available on request.
