The study on target detection technology based on improved YOLOv3 model

Abstract

Autonomous driving is a complex system which includes perception, cognition and control functions. Environmental perception represented by target detection is an essential part of autonomous driving technology. To improve the performance of target detection, the model is constructed based on YOLOv3 and improved with three methods to detect the obstacles of vehicles/cyclists/pedestrians on the road. The double population genetic algorithm is used to enhance the K-means clustering analysis, which is named K-IGA algorithm to optimize the value of anchor box. According to the structure of the convolution module, the Batch Normalization (BN) layer is merged into the convolution layer to obtain a new output function. The calculation of the loss function is improved using the Generalized Intersection over Union (GIOU). Then, the training set and test set are made based on the data set KITTI. The YOLOv3 algorithm with different improvement methods is named MD2, MD3, MD4. They are trained and tested on the data set with the classical YOLOv5 algorithm. The results show that the MD4 algorithm with all the above three improvement methods has the highest accuracy. Compared with the benchmark model MD1 and YOLOv5. Its mAP is 7.8% higher than MD1′s and 0.55% higher than YOLOv5′s. The test results show that the detection model can effectively identify vehicles, pedestrians and cyclists in the road scene. The real-time performance meets the requirements of target detection during vehicle driving.

Keywords

dual population genetic algorithm clustering analysis convolutional neural network target detection autonomous driving

Introduction

Autonomous driving technology can reduce the participation of drivers in the process of vehicle driving and improve driving safety.¹ Improving the detection accuracy of vehicles, cyclists and pedestrians in the road is one of the key technologies to realize autonomous driving.^2,3

At present, the most effective target detection algorithm is the deep convolutional neural network,^4,5 which integrates the three basic steps of target feature extraction, target classification and target location into a complete convolutional neural network structure. It can divide two categories: two-stage target detection network and single-stage target detection network.

The two-stage target detection network first obtains the target candidate box, then extracts the target features from the target candidate box and generates the detection results. At present, the two-stage target detection network has developed from RCNN to Faster RCNN.⁶ Zeng et al.⁷ integrated the anti-occlusion network into the standard Faster R-CNN detection algorithm, which obtained better accuracy and robustness of underwater target detection. Based on Faster R-CNN, Li et al.⁸ constructed a five-layer structure fusion multi-target detection and recognition algorithm, which improved the accuracy and speed of target recognition in complex traffic environments.

The two-step algorithm has great detection precision, but whose real-time performance is still rugged to meet the requirements of autonomous driving. The most representative single-step algorithm is YOLO algorithm which omits the candidate region extraction process to improve the speed of target detection.⁹ Through the continuous improvement of researchers, the YOLO series of algorithms have achieved excellent accuracy and real-time performance.^10,11 Panigrahi and Raju¹² proposed an improved YOLOv3 network based on SqueezeNet architecture, and the performance of pedestrian detection verified the effectiveness of the proposed algorithm. Yi et al.¹³ improved the network structure of tiny-yolov3. Experimental results showed that this method had high pedestrian detection accuracy under the premise of satisfying real-time performance. Ahmed et al.¹⁴ combined Gaussian YOLOv3 with channel attention and feature interleaving module to learn the weight of each channel which enhanced the network’s ability to discriminate between people and background.

The clustering algorithm is a crucial enabler for convolutional neural networks. Martí Caro et al.¹⁵ optimized the critical power and bandwidth of YOLOv3 by independent weight clustering for each neural network to shave bandwidth requirements down to 30%–40% and reduce energy consumption to 45%. Wang et al.¹⁶ used the improved k-medians clustering method instead of the previous k-means to improve the model instability in YOLOv3 method and which achieved good detection results on the KITTI and UA-DETRAC public datasets. Dong et al.¹⁷ introduced C3Ghost and Ghost modules into the neck network to reduce floating-point operations, and introduced the Convolution Block Attention Module into the backbone network to suppress unimportant information, which improved the operation speed and detection accuracy of YOLOv5.

Based on the above literature, it can be seen that optimizing the network structure and introducing the clustering method into the neural network are the main improvement measures to improve the YOLO model. However, clustering algorithm can easily fall into local optimal solution when searching for the cluster center. The evolutionary algorithms such as genetic algorithm,¹⁸ particle swarm optimization can effectively improve the performance of clustering.¹⁹

In this paper, the YOLOv3 algorithm for autonomous driving is improved with three methods. Self-made target and detection data sets are used to compare the performance of three improved YOLOv3 algorithms and classical YOLOv5 algorithms.

The main contributions of this study are summarized as follows:

Aiming at the best clustering performance, the double population genetic algorithm of Gauss mutation and Cauchy mutation is used to select the clustering center point, and the improved clustering algorithm is used to determine the value of the anchor box.

Improving the calculation method of GIOU loss function helps the model to obtain more information of small targets such as pedestrians and cyclists in the training process. Extracting more feature information is conducive to improving the detection accuracy of the model.

According to the convolution module structure, the convolution layer and the BN layer equation are combined to improve the forward reasoning speed.

The various improved models of YOLOv3 and the classical YOLOv5 model are trained and tested on the data set. The YOLOv3 algorithm including the three improved methods has the highest accuracy and can effectively identify small targets in road scenes.

The remainder of this paper is organized as follows. In Section 2, three improved methods of YOLOv3 are proposed in turn, which are using K-IGA algorithm to optimize the value of anchor box, using GIOU idea to improve the calculation method of loss function, and merging BN layer to convolution layer in neural network. In Section 3, we first prepare the data set and configure the computing environment, then use the performance evaluation index to analyze the detection accuracy of various improved models of YOLOv3 and the classical YOLOv5 model. The improved YOLOv3 model obtains better detection performance. In particular, it achieves higher detection accuracy in small targets characterized by pedestrians. Finally, Section 4 provides the conclusion and prospect of the improved YOLOv3 model for target detection in autonomous driving.

Construct an improved YOLOv3 target detection model

Overview of convolutional neural networks based on YOLOv3

The convolutional neural network has a variety of structural layers such as convolutional layer, pooling layer, input layer, output layer, and fully connected layer.²⁰ Because the image feature dimension after convolution operation is too large to calculate directly. The downsampling layer performs pooling operation, and uses the pooling box to gradually scan the input feature map to simplify the relevant information and complete the abstraction of practical information.

After several generations of improvement of the YOLO model, YOLOv3 increases the number of layers of the neural network, removes the pooling layer and the fully connected layer, accelerates the detection speed. So it is widely used in the area of object detection. Figure 1 shows the network structure of YOLOv3.

Figure 1.

YOLOv3 network structure.

Before training the convolutional neural network, the training set should be processed with anchor box. The concept of anchor box comes from the Faster R-CNN algorithm, which is the prior value of the candidate target box set. In the YOLOv3 algorithm, anchor box constrains the range of the detection object by labeling several groups of different width and height values obtained by clustering the rectangular boxes on the training set, and the model is trained according to the constraint range.

The left backbone of the convolutional neural network is responsible for processing images to obtain feature mapping information on multiple scales. The information enters the feature fusion network on the right side, and the detection results are output through feature extraction and fusion on different scales. YOLOv3 uses a deeper network structure and adds a residual module to construct a more advanced Darknet-53 neural network skeleton.²¹ This network structure contributes to the deeper dissemination of information, and achieves higher processing speed while taking into account the recognition accuracy.

Improvement of YOLOv3 target detection algorithm

Clustering anchor by K-IGA algorithm

YOLOv3 has been embedded in the clustering algorithm to obtain the required anchor box.^18,21 However, the performance of the clustering algorithm is affected by the selection of the initial center, which is not easy to obtain a better clustering effectiveness, thus reducing the accuracy of the anchor box selection. To optimize the performance of the clustering algorithm, this paper refers to the Improved genetic algorithm (IGA).²² According to the fitness value of each individual, IGA uses Gaussian mutation operator and Cauchy mutation operator to divide the population into two sub-populations to obtain excellent local search ability and global search ability. Then it is used to find the best center point of the k-means algorithm and named the K-IGA algorithm which is applied to the anchor box selection. The flow chart of the algorithm is shown in Figure 2 and describes as follows:

(1) N individuals are the initial population of genes generated by a random number, and the coordinate value of each k center is specified for the cluster center $x_{i}$ . $x_{i} = (c_{1}, c_{2}, \dots, c_{k})$ , where $c_{ji} = (c_{j 1}, c_{j 2}, \dots, c_{ji})$ , $i = 1, 2, . . . N$ , $j = 1, 2, \dots, k$ , $c_{j}$ represents the $j$ th cluster center. Before starting the operation, the optimal value replacement number $num = 0$ and the number of iterations $iter = 1$ are set, and the main parameters of the single activity area $range$ , activity rate c, selection parameters $Pa$ , cross parameters $Pc$ , and iteration termination times M are created.

(2) Let the data set Z contain h groups of data, which can be expressed as $Z = (z_{l 1}, z_{l 2}, \dots, z_{\ln}$ , $l = 1, 2, \dots, h)$ . According to the Euclidean distance, each group of data $z_{l}$ is divided into the closest cluster center $c_{j}$ in the $x_{i}$ , and Z is divided into k classes. Each data is classified to the closest cluster center, and all their distances are calculated as the fitness function of the $x_{i}$ . The fitness function is as follows (1):

f (x_{i}) = \sum_{j = 1}^{k} \sum_{l = 1}^{h} \sqrt{\sum_{p = 1}^{n} {(z_{lp} - c_{jp})}^{2}}

(1)

Figure 2.

The flow chart of the algorithm process.

$c_{j}$ is the closest cluster center to $z_{l}$ . The fitness function of N initial individuals is calculated and the minimum value is marked as $x_{c - best}$ .

(3) The population is processed by selecting and crossing according to the selection parameter $Pa$ and the cross parameter $Pc$ . Genes with low fitness value will increase the probability of selection. Randomly selected and paired exchange gene points when the individual is crossed. Merge the groups after performing two operations.

(4) Perform group mutation operations. The equation (2) calculates the proportional transformation function value of the new individual’s fitness. The individual performs Cauchy mutation when $F (x_{i})$ is greater than 0.5, otherwise the individual performs Cauchy mutation. The transformation process is as follows (3):

F (x_{i}) = \frac{f (x_{i}) - f_{m i n}}{f_{m a x} - f_{m i n}}

(2)

x_{i}^{'} = {\begin{matrix} x_{i} + range \cdot F (x_{i}) \cdot N_{i} (0, 1) & F (x_{i}) \leq 0.5 \\ x_{i} + range \cdot F (x_{i}) \cdot C_{i} (1, 0) & F (x_{i}) > 0.5 \end{matrix}

(3)

Where $f_{\min}$ and $f_{\max}$ represent the lowest and highest values of monomer fitness in this generation group, $f (x_{i})$ represents fitness of $x_{i}$ ; $x_{i}$ and $x_{i}'$ are the ith individual before and after variation, $F (x_{i})$ is the proportional transformation function value of $x_{i}$ , $N_{i} (0, 1)$ is the Gaussian distribution random number, $C_{i} (1, 0)$ is the Cauchy distribution random number.

(5) The N individuals with the smallest fitness in the group after the previous step are extracted to complete the optimization mutation. The individual in the group is combined with the optimal individual $x_{opt}$ in the current round through equation (4). The new individual will be updated if its fitness is less than the original individual’s. Otherwise, the random update calculation is carried out according to equation (5) and whether the individual needs to be updated is judged again. Select the minimum fitness individual as $x_{opt}$ from the group after completing the above operation.

x_{i}' = (1 - c) \cdot x_{i} + c \cdot x_{opt}

(4)

x_{i}' = {\begin{matrix} x_{i} + (x_{i} - x_{\min} () (1 - ran d^{{(\frac{(1 - iter)}{M})}^{2}})) & rand \leq 0.5 \\ x_{i} + (x i_{\max} () (1 - ran d^{{(\frac{(1 - iter)}{M})}^{2}})) & rand > 0.5 \end{matrix}

(5)

Where $x_{i}$ and $x_{i}'$ are the ith individual before and after the optimization mutation ; $x_{opt}$ is the optimal individual of this generation ; $x_{\max}$ and $x_{\min}$ are the maximum and minimum values of $x_{i}$ ; $rand$ is a random number of the interval $[0, 1]$ .

(6) Compare the fitness value of the old and new optimal individual, if $f (x_{c - best}) > f (x_{opt})$ , then $num = 0$ , $x_{c - best} = x_{opt}$ , otherwise $num = num + 1$ . When $num = 10$ , the group is randomly updated according to equation (5) and $num$ is returned to 0.

(7) Query the current number of iterations, if $iter < M$ , then $iter = iter + 1$ , jump to (3); on the contrary, stop the calculation operation and get the optimal $x_{c - best}$ .

It can be seen that the fitness value corresponding to the optimal $x_{c - best}$ is the smallest and $x_{c - best}$ is the optimal value of the k cluster centers. The width and height of the anchor box are input into the K-IGA algorithm as two-dimensional data for clustering. The new anchor box value can be more in line with the actual characteristics of the data set, thereby improving the performance of the YOLOv3 model.

Loss function modification

In some data sets, small-sized target objects such as cyclists and pedestrians only occupy a small part of the image. The training model obtains little practical information from the corresponding pixels, and the conventional method makes it difficult to train better recognition performance. Modifying the loss function may be a better method to improve the detection accuracy.

In YOLOv3, the concept of loss function is used to characterize the deviation between the predicted and actual values of the network model, which can be represented by equation (6):

L = L_{loction} + L_{confidence} + L_{class}

(6)

Where $L_{loction}$ is the regression loss of bounding box position, $L_{confidence}$ is the target confidence loss, and $L_{class}$ is the target classification loss. The confidence loss can be quantified according to existence and accuracy. The equation of confidence coefficient Conf is as follows:

Conf = P_{object} \times IOU

(7)

Where $P_{object}$ is the target object judgment index of the unit cell. The equation of Intersection over Union (IOU) is as follows:

IOU = \frac{A \cap B}{A \cup B}

(8)

L_{IOU} = 1 - IOU

(9)

Where A is the box predicted by the model, B is the actual target box, and $L_{IOU}$ is the loss function for $IOU$ . This evaluation method has the advantages of symmetry and non-negative scale invariance.

This calculation method has apparent shortcomings in two aspects:

(1) When $A \cap B = 0$ , $IOU = 0$ , $L_{IOU} = 1$ , the network cannot continue to optimize. If the distance between the two boxes is far, the loss value should be larger, and the loss function cannot reflect the actual situation.

(2) The value of $IOU$ can only reflect the degree of overlap, but can not reflect the mode of overlap, and the mode of overlap will affect the network optimization performance.

Because of the deficiencies, this network uses the $GIOU$ proposed by Hamid.²³ The equation is as follows:

GIOU = IOU - \frac{| C - A \cup B |}{C}

(10)

L_{GIOU} = 1 - GIOU

(11)

Where C is the minimum circumscribed rectangle of A and B coverage area, $L_{GIOU}$ is the loss function of $GIOU$ .

Compared with the traditional evaluation index $IOU$ , $GIOU$ inherits the advantages of scale invariance and quantification of coincidence degree, and quantify the two detection boxes with the same coincidence degree but different overlap modes. Thus, the gradient calculation of the loss function is effectively performed to improve the performance of the objective function. When the data set contains more small-sized targets, more feature information is extracted to improve the detection accuracy of the model.

Modifying network structure

In the training process of convolutional neural networks, if the samples of different batches are unevenly distributed, the model consumes more time to adapt to the uneven sample distribution. The convolution module in YOLOv3 contains a BN layer, which is used to normalize the input data. This operation can solve the problem of uneven data distribution between layers, but it will increase the amount of computation during forward propagation and extend the target detection time. Combining the BN layer into the convolutional layer can improve the detection speed.

The convolution layer in the convolution module is as follows (12):

g_{i, j} = W_{conv} \times f_{i, j} + b_{conv}

(12)

Where $g_{i, j}$ is the output of the convolutional layer, $f_{i, j}$ is the feature map data, $W_{conv}$ is the convolutional matrix, and $b_{conv}$ is the offset term.

For the BN layer in the convolution module, the operation process is as follows (13):

B N_{out} = \frac{γ \times (g_{i, j} - x_{aver})}{\sqrt{Var}} + β

(13)

Where $Var$ is the dispersion in training process, $x_{aver}$ is the average number of input data for one batch of the layer, and $B N_{out}$ is the output of the layer.

According to the structure of the convolution module, the output of the convolution layer will be input into the BN layer, and the two modes will be merged into:

W_{CB} = \frac{γ \times W_{conv}}{\sqrt{Var}}

(14)

b_{CB} = \frac{γ \times (b_{conv} - x_{aver})}{\sqrt{Var}} + β

(15)

F_{i, j} = B N_{out} = W_{CB} \times f_{i, j} + b_{CB}

(16)

$F_{i, j}$ is the convolution module outputs, $W_{CB}$ and $b_{CB}$ are the new weight and bias terms. The above derivation belongs to the mathematical equivalent transformation, so using the new weight term is theoretically the same as the treatment effect with the BN layer, and the forward reasoning speed is improved.

Target detection model training and application

Data set preparation and computing environment setting

To verify the accuracy of the target detection algorithm, the authoritative data set KITTI is selected for batch processing,²⁴ and 4904 pictures with 608 × 608 are obtained. Four thousand and one hundred pictures are randomly selected as training sets, 804 pictures as test sets. The distribution of sample types is balanced as far as possible. Among them, 15,016 vehicles are classified as “Car,” 7129 pedestrians are classified as “Pedestrian,” and 2886 cyclists are classified as “Cyclist.”

The training and testing of YOLO model is based on the self-built data processing platform. The platform uses Compute Unified Device Architecture(CUDA) parallel computing architecture to support Graphics Processing Unit(GPU) acceleration and improve the efficiency of the algorithm.

In the training process of test, the batch_size is 64, the subdivision is 32, the number of iterations of model training is 8000, the learning rate uses burn_in mode in the first 1000 iterations, the learning rate is 10e–3 from 1000 to 6400 iterations, the iterative learning rate is reduced to 10e–4 from 6400 to 7200 iterations, the learning rate is reduced to 10e–5 from 7200 to the end of iteration, and the overlap threshold is set to 0.5.

Performance evaluation index of model

To comprehensively measure the performance of the model, the following indicators are used to evaluate the effectiveness of the model:

precision = \frac{TP}{TP + FP} = \frac{TP}{D}

(17)

Where $TP$ is the true/positive sample, $FP$ is the false/positive sample, and $D$ is all the predicted results.

The recall rate expression is:

recall = \frac{TP}{TP + FN} = \frac{TP}{GT}

(18)

Where $TP$ is the true/positive sample, $FN$ is the false/negative sample, $GT$ is all the annotation results.

Average Precision(AP)) is the evaluation index of the overall accuracy of the algorithm on a certain class. $mAP$ is the comprehensive accuracy of the model on all classes, which can be expressed as:

mAP = \frac{\sum AP}{N_{cls}} = \frac{\sum_{i = 1}^{N} \int_{0}^{1} P_{i} (R) dR}{N_{cls}}

(19)

Where $N_{img}$ is the total number of pictures, $N_{cls}$ is the total number of detection categories, $P_{i} (R)$ is the precision-recall curve.

$mAP$ is related to the threshold of $IOU$ , when the intersection ratio of prediction box to the real box is greater than the threshold, the target detection object can be identified as a positive sample, and the threshold is set to 0.7.

Frames Per Second(FPS) is the detection speed of the model for the test set and can be expressed as:

FPS = \frac{N_{test}}{T_{test}}

(20)

Where $N_{test}$ is the number of test images, $T_{test}$ is the reasoning time.

Results and analysis of target the detection model

Based on different YOLOv3 algorithm improvement methods, with the unified sample set and test set, the performance of target detection is analyzed according to the performance evaluation index. Specifically, model MD1 uses YOLOv3′s own cluster analysis to determine the anchor box. Model MD2 uses the anchor box generated by K-IGA clustering. Model MD3 merges the BN layer in the convolution module to the convolution layer on the basis of MD2, and MD4 improves the loss function to GIOU on the basis of MD3.

Clustering anchor boxes by K-IGA algorithm

The YOLOv3 model defaults to use nine groups of anchors for learning on three feature scales. The values of these nine groups of anchors are derived from the clustering of the actual boxes of the COCO data set. To improve the selection of anchor value, this test uses the K-IGA algorithm to cluster the actual box of the self-made training set. Normalize the width and height of all the annotation files in the training set to make a table with two-dimensional information of width and height. The K-IGA algorithm is used for clustering, and the clustering center is the value of the Anchor of the data set. Figure 3 is the clustering result.

Figure 3.

Anchor clustering results after normalization.

The array obtained by K-IGA algorithm clustering is the width and height of the normalized annotation box, which needs to be denormalized. The Anchor value obtained is shown in Table 1.

Table 1.

Anchor values of different models.

Name	Anchor value
Model C9	(15,13) (10,28) (31,20) (21,50) (55,30) (41,84) (91,49) (123,85) (184,112)
Model A1	(10,13) (16,30) (33,23) (30,61) (62,45) (59,119) (116,90) (156,198) (373,326)

Different methods improve the performance of YOLOv3 model

The training results of YOLOv3 with different improved methods and classical YOLOv5 are shown in Figure 4. The accuracy of the model gradually increases and the loss function value decreases with the number of iterations increases. The loss function of model MD1 and MD2 tends to be stable around 2000 iterations, model MD3 tends to be stable around 4000 iterations, and model MD4 tends to be stable around 6000 iterations. From the precision recall curve, the accuracy of the car is the highest in the four YOLOv3 models and classical YOLOv5 model, and the accuracy of cyclists and pedestrians is relatively low. The precision recall curve of MD2 is better than MD1′s, indicating that the anchor box improved by K-IGA algorithm is better than the anchor box determined by the clustering algorithm in YOLOv3. Due to the improvement of MD3 network structure, the speed has improved, but the precision recall curve is slightly worse than MD2′s. Compared with MD4, the detection accuracy of cars and cyclists in classical YOLOv5 is slightly better, but the accuracy of pedestrians is not as good as that of MD4 model.

Figure 4.

Performance curves of the training models. (a) area average IOU curve of MD1. (b) Average loss curve of MD1 (c) Precision-recall curve of MD1. (d) area average IOU curve of MD2. (e) Average loss curve of MD2. (f) Precision-recall curve of MD2. (g) area average IOU curve of MD3. (h) Average loss curve of MD3 (i) Precision-recall curve of MD3. (j) area average IOU curve of MD4. (k) Average loss curve of MD4. (l) Precision-recall curve of MD4. (m) Precision-recall curve of YOLOv5 Model. (n) Box_loss in training. (o) Obj_loss in training. (p) Cls_loss in training.

Table 2 shows the performance evaluation index of the models. It can be found that the mAP of MD1 is 0.8086, of which the AP of cars is 0.8765, and the AP of cyclists and pedestrians is 0.7865 and 0.7629, respectively. The performance of the model MD2 is greatly improved. Its mAP is increased to 0.8544, the AP of the cyclist is increased to 0.8723, and the AP of the pedestrian is increased to 0.7945. The FPS of model MD2 is lower than that of model MD1, which is reduced by 16.7%. The mAP of model MD3 is close to that of MD2. Because it combines the BN layer with the convolution layer in the convolution module, the FPS of MD3 is improved. Compared with MD3, MD4 has noticeable improvement in mAP, especially in pedestrians, and the FPS is consistent. Compared with MD2, MD4 has a significant improvement in the mAP of cars and pedestrians, but the AP of cyclists is slightly reduced. Compared with MD1, the mAP of each detection of MD4 is significantly improved, which is increased by 7.8%, the AP of the cars is increased by 4.8%, the AP of the cyclists is increased by 9.3%, the AP of the pedestrian is increased by 10.6%, and the FPS is increased by 4.1%. The mAP and AP of classical YOLOv5 model are better than the those of MD1, MD2 and MD3. Its mAP is 0.8665, which is slightly smaller than MD4′s, indicating that its detection accuracy is slightly lower than MD4. Its AP of cars and cyclists are slightly better than the MD4′s, but the AP of pedestrians is much smaller than the MD4′s, indicating that the detection performance for small targets is relatively poor. MD4 shows excellent performance as a whole.

Table 2.

AP, mAP, and FPS of different models.

Name	AP of car	AP of cyclist	AP of pedestrian	mAP	FPS/Hz
MD1	0.8765	0.7865	0.7629	0.8086	14.42
MD2	0.8964	0.8723	0.7945	0.8544	12.36
MD3	0.9063	0.8560	0.7927	0.8517	15.13
MD4	0.9182	0.8598	0.8441	0.8713	15.01
YOLOv5 Model	0.928	0.864	0.810	0.8665	/

Therefore, using the appropriate anchor box, improving the model structure and modifying the loss function can improve the average accuracy of the model.

Target detection effect verification

To verify the recognition effectiveness of the target detection model, the typical scene of the test set is selected by the MD4 model. Figure 5 shows the effect of scene detection. The model can detect three types of targets: vehicles, pedestrians and cyclists in complex traffic scenes, and output target classification information. The detection model can accurately extract the target object in the information-rich environment, and has a good recognition effect on distant, occluded, and partially appeared targets.

Figure 5.

Test effect of actual scene.

Conclusions and prospects

YOLOv3 is improved with three methods and obtains better recognition performance. The data processing platform is constructed based on YOLOv3, and the anchor box parameters are determined by the clustering analysis of YOLOv3, thus forming MD1 as the benchmark model. The double population genetic algorithm improves the clustering analysis, and the K-IGA algorithm is obtained. The value of the anchor box is optimized by the K-IGA algorithm, and the MD2 model is constructed. On the basis of MD2, the BN layer is merged into the convolutional layer to obtain the MD3 model. Based on MD3, the GIOU idea is used to improve the calculation method of the loss function, and the MD4 model is obtained.

A total of 4904 images including vehicles, pedestrians and cyclists are produced as data sets for model training. The results show that the mAP and AP of MD2 are better than MD1, but the FPS is reduced by 16.7%. MD3 has almost no decrease in mAP while optimizing the FPS. The overall performance of MD4 is better than that of MD2 and MD3. Compared with MD1, the mAP of MD4 is improved by 7.8%, the AP of cars is improved by 4.5%, the AP of cyclists is improved by 8.5%, the AP of pedestrians is improved by 9.6%, and the FPS is improved by 4.1%. Considering the detection accuracy and speed, MD4 obtains the best target detection performance. Compared with classical YOLOv5, MD4 is slightly better than YOLOv5 as a whole.

Although the improved model of yolov3 has shown promising results, there are still some areas worthy of further research and development.

The perception of the actual road environment should not be limited to vehicles, cyclists and pedestrians. Traffic lights, signal boards, and other types of obstacles can be further added to identify targets, especially the identification of occluded obstacles, to improve the universality of the target detection model. In the process of optimizing the structure of convolutional neural network, the interaction between different structures and anchor box parameters on detection accuracy can be further studied.

Footnotes

ORCID iD

Wei Yan

Ethical Considerations

This work did not involved humans and animals. Ethic approval was not required for this research.

Consent to participate

There is no such case.

Consent for publication

The corresponding author gave consent for the publication of the identifiable details.

Author contributions

Concept and design: Wei Yan and Jiashu Ji; data collection and analysis: Jiashu Ji and Lingzhi Xu; drafting of the article: Jiashu Ji, Wei Yan and Fangzhe Sun; critical revision of the article for important intellectual content: Wei Yan and Jiashu Ji. All the authors approved the final article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by China Shandong province Key Research and Development Program of Grant No.2023CXGC010210 and China Shandong province Key Research and Development Program of Grant No. 2020CXGC011004.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

Data will be made available on request.

References

Bachute

Subhedar

JM.

Autonomous driving architectures: insights of machine learning and deep learning algorithms. Mach Learn Appl 2021; 6: 100164.

Sharma

Dhiman

Indu

Pedestrian intention prediction for autonomous vehicles: a comprehensive survey. Neurocomputing 2022; 508: 120–152.

Sarcinelli

Guidolini

R B.

Cardoso

, et al. Handling pedestrians in self-driving cars using image tracking and alternative path generation with Frenét frames. Comput Graph 2019; 84: 173–184.

Zhao

Deng

, et al. Autonomous driving system: a comprehensive survey. Expert Syst Appl 2024; 242(15): 122836.

Jiang

Liu

, et al. CrossPrune: Cooperative pruning for camera–LiDAR fused perception models of autonomous driving. Knowl Syst 2024; 289(8): 111522.

Ghosh

A faster R-CNN and recurrent neural network based approach of gait recognition with and without carried objects. Expert Syst Appl 2022; 205(1): 117730.

Zeng

Sun

Zhu

Underwater target detection based on faster R-CNN and adversarial occlusion network. Eng Appl Artif Intell 2021; 100: 104190.

Wang

, et al. A method of cross-layer fusion multi-object detection and recognition based on improved faster R-CNN model in complex traffic environment. Pattern Recognit Lett 2021; 145: 127–134.

Huang

Zhang

Wang

, et al. Improved YOLOv3 model for miniature camera detection. Opt Laser Technol 2021; 142: 107133.

10.

Song

Gao

Chen

A multispectral feature fusion network for robust pedestrian detection. Alex Eng J 2021; 60(1): 73–85.

11.

Bie

Liu

, et al. Real-time vehicle detection algorithm based on a lightweight You-Only-Look-Once (YOLOv5n-L) approach. Expert Syst Appl 2023; 213: 119108.

12.

Panigrahi

Raju

USN

. MS-ML-SNYOLOv3: a robust lightweight modification of SqueezeNet based YOLOv3 for pedestrian detection. Optik 2022; 260: 169061.

13.

Yongliang

Jun

An improved tiny-yolov3 pedestrian detection algorithm. Optik 2019; 183: 17–23.

14.

Ahmed

Jeon

Chehri

, et al. Adapting Gaussian YOLOv3 with transfer learning for overhead view human detection in smart cities and societies. Sustain Cities Soc 2021; 70: 102908.

15.

Caro

Tabani

Abella

At-scale evaluation of weight clustering to enable energy-efficient object detection. J Syst Archit 2022; 129: 102635.

16.

Wang

Liu

An advanced YOLOv3 method for small-scale road object detection. Appl Soft Comput 2021; 112: 107846.

17.

Dong

Yan

Duan

A lightweight vehicles detection network model based on yolov5. Eng Appl Artif Intell 2022; 113: 104914.

18.

Ghezelbash

Daviran

Maghsoudi

, et al. Incorporating the genetic and firefly optimization algorithms into K-means clustering method for detection of porphyry and skarn Cu-related geochemical footprints in Baft District, Kerman, Iran. Appl Geochem 2023; 148: 105538.

19.

Kumar

A fuzzy clustering technique for enhancing the convergence performance by using improved fuzzy C-means and particle swarm optimization algorithms. Data Knowl Eng 2022; 140: 102050.

20.

Wang

Kuen

, et al. Recent advances in convolutional neural networks. Pattern Recognit 2018; 77: 354–377.

21.

Pathak

Raju

USN

. Shuffled-Xception-darknet-53: a content-based image retrieval model based on deep learning algorithm. Comput Electr Eng 2023; 107: 108647.

22.

Yan

Mei

, et al. Control strategy research of electric vehicle thermal management system based on MGA-SVR algorithm. Meas Control 2023; 56: 1026–1036.

23.

Liu

Zhang

Shi

, et al. Detection method of the seat belt for workers at height based on UAV image and YOLO algorithm. Array 2024; 22: 100340.

24.

Zamanakos

Tsochatzidis

Amanatiadis

, et al. Feature aware re-weighting (FAR) in bird’s eye view for LiDAR-based 3D object detection in autonomous driving applications. Robot Auton Syst 2024; 175: 104664.