Pedestrian detection based on improved LeNet-5 convolutional neural network

Abstract

In this article, according to the real-time and accuracy requirements of advanced vehicle-assisted driving in pedestrian detection, an improved LeNet-5 convolutional neural network is proposed. Firstly, the structure of LeNet-5 network model is analyzed, and the structure and parameters of the network are improved and optimized on the basis of this network to get a new LeNet network model, and then it is used to detect pedestrians. Finally, the miss rate of the improved LeNet convolutional neural network is found to be 25% by contrast and analysis. The experiment proves that this method is better than SA-Fast R-CNN and classical LeNet-5 CNN algorithm.

Keywords

Pedestrian detection feature extraction LeNet-5 convolutional neural network

Introduction

In the field of computer vision, pedestrian detection technology can accurately position pedestrians, and so is widely used in advanced intelligent vehicle auxiliary driving, intelligent robot, human behavior analysis, intelligent video surveillance, and other fields. According to statistics, more than 430,000 people are injured in traffic accidents worldwide each year, and more than 39,000 people are killed in traffic accidents.¹ With the increasing development of urbanization, the environmental conditions of urban and rural traffic roads have become complicated, and the life and safety of people have become difficult to be guaranteed. Therefore, a number of car dealers, research institutions, and universities have begun to conduct in-depth detection and research on pedestrians. For example, the research institutions of foreign pedestrian detection systems include Daimler Chrysler R&D Center, Carnegie Mellon University (CMU), German Volkswagen, Japan Toyota Motor Research Center,² Domestic Jilin University, Tsinghua University, Shanghai Jiaotong University, Automation Institute of Chinese Academy of Sciences, Xi'an Jiaotong University, Zhejiang University, University of Science and Technology of China,³ etc.

Since 2002, some researchers have started to study pedestrian simple features and classification algorithms. Dalai and Triggs proposed the histogram of oriented gradients (HOG) algorithm in 2005,⁴ where the features of the algorithm can improve pedestrian detection to a certain extent. Subsequently, pedestrian detection technology appeared to be practical, and the size of datasets began to expand. In 2008, Pedro Felzenszalb and others considered the variability of pedestrians and expanded the features of the HOG algorithm to further improve the pedestrian detection performance.⁵ In 2013, the Chinese University of Hong Kong⁶ used the convolutional neural network (CNN) model to combine the occluded pedestrian model with the deformation model, used the CNN network model structure to extract pedestrian features, and detected pedestrians, thus achieving good results in dealing with pedestrians’ occlusion problems. Beginning in 2015, some researchers began to study different network structure experiments to analyze the impact of the number and size of different convolution kernels on the pedestrian detection performance. Through several experiments, the detection performance as well as the real-time performance and accuracy have improved.

Pedestrian detection algorithm can be divided into two categories: one based on the background modeling method, which extracts the target of foreground motion and performs feature extraction in the target area. Then the classifier is used to classify whether pedestrians are included or not. But the main problems of the background modeling are as follows: it is difficult to detect the dense material, illumination changes result in changes of the image color, background difference detection algorithm could be covered with the object of the area error detection for sports. The other is based on the statistical learning method, which is also the most commonly used method for pedestrian detection. Pedestrian detection classifier is constructed according to a large number of samples. The extracted features mainly include the target’s gray scale, edge, texture, color, gradient histogram, and others. Classifiers mainly include neural networks, support vector machine (SVM), adaboost, and deep learning. However, there are some difficulties in statistical learning.⁷ For example, pedestrians are different in size, posture, and clothing coupled with complex background. The performance of classifier is greatly influenced by training samples, and negative samples during off-line training cannot cover all real application scenarios.

The current pedestrian detection is based on the pedestrian detection algorithm of HOG+SVM proposed by Dalal, a French researcher.⁸ The classic LeNet-5 network can also detect pedestrians and is superior to HOG+SVM in algorithm, but it takes a long time. In order to solve this problem, this paper combined the LeNet-5 CNN to automatically study the texture and gradient feature of pedestrian targets and introduced a new algorithm, namely nine-layer LeNet CNN to detect pedestrians. Based on the strong learning ability and robustness of the new LeNet CNN, a pedestrian detection sample database and a CNN training detection model were established, and a large number of pedestrian detection samples were trained to obtain deep sample features. A pedestrian detection system with high detection speed, high accuracy, and strong scene adaptability can be obtained. By using this pedestrian detection system, vehicles can autonomously detect the surrounding environment in real time. It also aids in the accurate analysis of test results to provide the driver with more accurate opinions and also helps the driver to respond to the hidden dangers in the surrounding environment immediately, so as to protect pedestrians.

Convolutional neural network pedestrian detection framework

The method of HOG feature and SVM classification can detect pedestrians with complex background, and has good robustness to light changes and high detection rate. However, the process of HOG feature extraction is time-consuming, so the real time is relatively weak. Although the classical LeNet-5 CNN has a higher detection rate than the HOG+SVM method, a large amount of time is wasted on training samples, and the real-time performance is also found to be weak. Therefore, a new pedestrian detection method is proposed based on LeNet-5 CNN structure, which can effectively detect pedestrians and improve real-time and accuracy.

The whole detection process consists of the following parts: first, the Caltech pedestrian detection training and test sample database is established; then, video frames were pre-processed, network parameters were adjusted, and the model was trained. Finally, pedestrian information is detected. The entire detection process is shown in Figure 1.

Figure 1.

Convolutional neural network framework.

Improved LeNet-5 CNN pedestrian detection algorithm

In this paper, the classical LeNet-5 CNN⁹ is improved by adding layers, which ensures high detection rate while maintaining the speed. Improvement is made from the following three aspects: first, normalized layer BN is added after each convolution layer. Then, the dynamic adaptive pooling model is adopted. The improved Relu activation function is used to alleviate the gradient vanishing problem effectively. The final classification layer is classified using SVMs. Through the optimization of these three methods, the training speed is greatly improved and the detection rate is also improved.

Addition of normalized layer BN¹⁰

When the image input convolution layer is trained, the gradient descent method is mainly used for training. The entire program must be manually set before the operation of the network parameters, learning rate, etc. This operation is cumbersome and the detection rate is subject to human factors, and the same cannot guarantee real-time requirements. To solve this problem, the normalization (BN) layer can increase the learning rate and accelerate the convergence rate.

The input data are normalized and then sent to the next layer, which will not adapt to the new data distribution. However, the normalization layer is a learnable and parameterized network layer, so the best way to preprocess data is by whitening. The amount of calculation is particularly large with Whitening pre-treatment. The formula used for pretreatment to facilitate calculations, approximate whitening is¹¹

{\overset{⌢}{x}}^{(k)} = \frac{x^{(k)} - E [x^{(k)}]}{\sqrt{Var [x^{(k)}]}}

(1)

where

E (x^{(k)})

is the average value of each batch of input data, and the denominator in above equation (1) indicates the standard deviation of each batch of data.

By directly using equation (1) to normalize the data will reduce the level of expression. If the data are distributed in the 0 to 1 interval, the sigmoid function will greatly reduce the ability to express the model. Therefore, two learnable parameters $γ, β$ are introduced, and the final output is as follows

y^{(k)} = γ^{(k)} {\overset{⌢}{x}}^{(k)} + β^{(k)}

(2)

Each neuron will have a pair of such parameters $γ, β$ when

γ^{(k)} = \sqrt{Var [x^{(k)}]}

(3)

β^{(k)} = E [x^{(k)}]

(4)

It can completely restore the original data of a layer. The entire BN algorithm is as follows

μ_{B} \leftarrow \frac{1}{m} \sum_{i = 1}^{m} x_{i}

(5)

δ_{B}^{2} \leftarrow \frac{1}{m} \sum_{i = 1}^{m} (x_{i} - μ_{B})^{2}

(6)

{\overset{⌢}{x}}_{i} \leftarrow \frac{x_{i} - μ_{B}}{\sqrt{δ_{B}^{2} + \in}}

(7)

y_{i} \leftarrow γ {\overset{⌢}{x}}_{i} + β \equiv B N_{γ, β} (x_{i})

(8)

The input data are $x_{1}, x_{2} \dots . x_{m}$ , where $m$ refers to the number of this batch of data. $μ_{B}$ is the mean, $δ_{B}^{2}$ is the variance, and $y_{i} = B N_{γ, β} (x_{i})$ is the output. This is the entire BN algorithm process.

After training the network, the data need to be tested, and BN is tested using the following

y = \frac{γ}{\sqrt{Var [x] + \in}} \cdot x + (β - \frac{γ E [x]}{\sqrt{Var [x] + \in}})

(9)

Because this is the test process, the average here is not the average of each batch, but the average of the entire dataset. During the training process, the mean and variance of each batch should be recorded so as to record the mean and variance of the entire dataset after the training

E [x] \leftarrow E_{B} [μ_{B}]

(10)

Var [x] \leftarrow \frac{m}{m - 1} E_{B} [δ_{B}^{2}]

(11)

2. Improvement of the sampling layer

The purpose of sampling is to perform quadratic feature extraction, among which pooling is the most important process. The high-level feature map obtained after pooling can not only reduce the dimension and resolution of the original feature map, but also avoid overfitting and other issues. Pooled methods include mean-pooling and max-pooling.¹² These two pooling models cause some damage to the representation of global features and the accuracy of the model, and so they cannot extract the features of the pooling region very well.

Owing to the above problems, this paper improves the pooling model based on the maximum pooling algorithm. The improved model is called the dynamic adaptive pooling model. The model can dynamically adjust its pooling process according to different feature maps, and adaptively adjust the pooling weight according to the content of each pooling region. If the pooling region has only one value, then this value is both the maximum and a representation of its features. If the eigenvalues of the pooling region are all the same, the maximum value can also be expressed as the eigenvalues of the pooling region. Therefore, on the basis of the maximum pooling algorithm, a mathematical model is constructed to simulate the function according to the interpolation principle. If $μ$ is a pooling factor, then the algorithm expression after improving the pooling model is shown in equation (12)

S_{ij} = μ \max_{i = 1, j = 1} (F_{ij}) + b_{2}

(12)

The pooling factor $μ$ is used to optimize the maximum pooling algorithm, and the optimized features can more accurately express features. The remaining parameters follow the parameter settings of the maximum pooling model

μ = ρ \frac{a (v_{\max} - a)}{v_{\max} 2} + θ

(13)

In equation (13), $a$ is the average value of the pooling region element except the maximum value, $v_{\max}$ is the maximum value of the pooling region element, $θ$ is the alignment error term, $ρ$ is the feature coefficient, and the calculation expression is shown in equation (14)

ρ = \frac{c}{1 + (n_{epo} - 1) c^{n_{epo}^{2} + 1}}

(14)

where

n_{epo}

is the iterations during training. In equations (12) to (14), the feature coefficient

ρ

depends on the length

c

of the pooling region and iterations

n_{epo}

; and the value of the feature coefficient and the pooling region determines the value of the pooling factor

μ

. In the case that the size of the pooling region is determined, the iteration cycle is kept unchanged, and the pooling factor will adaptively take values according to different pooling regions. When facing the same pooling region, the pooling factor will be dynamically adjusted to achieve the best according to different iterations. Considering the pooling factor

μ \in (0, 1)

, which can not only give consideration to the maximum and average pooling, but also weaken the impact of the maximum pooling when the other pooling regions are dealt with. Therefore, the CNN can extract more accurate features when processing different pooling regions under different iterations.

Dynamic adaptive pooling is improved and optimized on the basis of maximum pooling. The input part of the maximum pooling model is a two-dimensional matrix such as the input feature map in Figure 2(a). Convolution kernel is also the same as Figure 2(a), and uses four different convolution kernels. The matrix of weights $A = [\begin{array}{l} 1 0 \\ 0 0 \end{array}], B = [\begin{array}{l} 0 1 \\ 0 0 \end{array}], C = [\begin{array}{l} 0 0 \\ 1 0 \end{array}]$ and $D = [\begin{array}{l} 0 0 \\ 0 1 \end{array}]$ , respectively convolves the input feature map, and the convolution results of four different values in the pooling region were obtained. We connect the convolution results, and then obtain a three-dimensional matrix $T_{1}$ of 4 × 7 × 7 by using the method of maximizing function to obtain a three-dimensional matrix $T_{2}$ of 1 × 7 × 7. By transposing the three-dimensional matrix to obtain a three-dimensional matrix of 7 × 7 × 1, which is the convolution result of the maximum value, we proceed with the delete operation in Figure 2(b). Final result is the sub-sampling feature map of the maximum pooling model.

Figure 2.

The process of realizing pooling.

When the dynamic adaptive pooling finds a three-dimensional matrix on the maximum pooling algorithm, it is further summed to obtain a three-dimensional matrix $T_{3}$ of 1 × 7 × 7. $T_{3}$ and $T_{2}$ are subtracted and then averaged. The matrix elements in the result are the average value $a$ in equation (9), and the improved sampling feature map can be obtained by further calculation.

3. Improvement of the ReLU activation function algorithm

The most commonly used activation functions are sigmoid functions and ReLU functions¹³ in CNNs. But the gradient vanishing problem during the post-transfer process of sigmoid function greatly reduced the training speed. While the ReLU activation function can effectively alleviate the gradient vanishing problem, it trains the deep neural network in the way of supervision, without relying on the unsupervised layer-by-layer pre-training, which significantly improves the performance of the deep neural network. But ReLU also has fatal flaws. First, the output of the ReLU function is prone to produce mean shift,¹⁴ and the neurons that cause the latter layer get a nonzero mean signal from the output of the previous layer, making the network parameter calculation difficult. Secondly, as the training progresses, part of the input will fall into the hard saturation region of the ReLU function, resulting in the inability to update the corresponding weight as shown in Figure 3. Mean shift and neuron death jointly affect the convergence and convergence speed of deep neural networks. Therefore, we improve it on this basis. The improved ReLU function can not only effectively alleviate the problem of gradient disappearance, but also effectively avoid the phenomenon of mean shift.

Figure 3.

ReLU activation function.

The part of ReLU function $x < 0$ is replaced by tanh function, and a new activation function is constructed, which is defined as in equation (15)

f (x) = {\begin{cases} x, x \geq 0 \\ α \tanh (x), others \end{cases}

(15)

The image is shown in Figure 4, where the image of $x < 0$ changes with the slope $α$ . As can be seen from the function expressions and images, the improved ReLU function has the advantage of the function ReLU in the linear part on the right side. Therefore, the gradient in the saturated zone will never be 0, which can effectively alleviate the gradient vanishing problem.

Figure 4.

Improved ReLU activation function.

On comparing the improved ReLU function with the original ReLU function, the nonlinear part ( $x < 0$ ) on the left can not only make the mean value closer to 0, but also avoid the mean shift. Moreover, the improved ReLU function does not cause neuronal death because the left part does not have the property of hard saturation.

4. Replacement of Softmax classifier by SVM classifier

The final output layer was identified by the SVM classifier.¹⁵ The function of the SVM is to minimize the risk of classification under the principle of spatial global separation of samples. According to the continuous updating of forward and backward propagation parameters of network training, Softmax classifier uses probability estimation to judge pedestrians. SVM can shorten the detection time, and does not need network iteration. The accuracy of classification is obviously higher than that of the Softmax classifier. The accuracy of classification is obviously higher than that of the Softmax classifier.¹⁶

5. The classical LeNet-5 CNN model structure was improved, and after deepening to nine layers, improved LeNet-5 CNN parameters were in Table 1, the structural changes were as follows (Figure 5):

1. C1 convolution layer. There are six feature maps in the convolution layer. The normalized BN algorithm is used for the feature maps respectively, and the size of the convolution kernel is normalized, and then the convolution operation is performed through the filter.

2. S1 pooling layer. The maximum pooled algorithm is adopted. The convolution kernel is 2*2, the sliding step of each pooled process is 2, and the image size after pooled is 14*14.

3. C2 convolution layer. The feature graph of this layer is combined with the BN layer, and the normalized BN algorithm is adopted to normalize the size of the convolution kernel, and then the convolution operation is carried out through the filter.

4. S2 pooling layer. The maximum pooled algorithm is adopted. The convolution kernel is 2*2, the sliding step of each pooled process is 2, and the image size after pooled is 5*5.

Figure 5.

Improved LeNet-5 CNN model structure.

Multistrategy fusion window selection

As shown in Figure 6, we combine selective search for multiscale selection windows (SEL) with binary normalized gradients (BING) to obtain preferred regions of high quality.

Figure 6.

Multistrategy fusion window selection.

In the preparation phase, we first use the INRIA dataset to train the BING model: we use the training set of the INRIA pedestrian database, and the training model uses a linear support vector machine. The positive example is the ground truth corresponding to the normalized gradient feature. The negative example is the randomly selected background window normalized gradient feature.

Using selective search algorithm to extract candidate regions. In order to obtain candidate region with higher quality and recall, we use the selective search quality model instead of the previous fast model.

The candidate regions of selective search proposal are filtered by using the aspect ratio and resolution features of pedestrians. Using this prior knowledge we can filter out a lot of useless windows. It not only improves the speed of subsequent algorithm, but also reduces a large part of the misjudgment window.

The candidate area is filtered using the trained BING model. Previous methods do not consider the edge features of the object, so we use BING to further filter the candidate region. It should be noted that the edge detection operator combined with some prior knowledge cannot distinguish objects and background very well, and the algorithm is time-consuming. But the BING method is simple, time-saving, and effective, which is why we choose BING for filtering.

After these three selection filters, we obtained a candidate region with a small number and higher quality.

Comparison of experimental results and analysis

Development platform

The image acquisition device is the Point Grey gray point camera with 13 million pixels, frame rate of 85 fps, Ethernet communication CMOS camera, 1280 × 1024 pixel. In order to reduce the amount of calculation on the appropriate processing of the image, the image pixel to be identified was 640 × 480. Using the TCP/IP protocol as the communication protocol between the server and the host PC client, the video transmission part is mainly connected by the server thread Server_thread to the host PC client through the Socket communication based on the TCP protocol. The image and video processing platform is an ASUS desktop with an Intel Core i7 processor, a 2.5 GHz frequency, a GTX1080 graphics card, a 120 GB solid state drive, and a memory size of 4G. The experimental platform is shown in Figures 7 and 8.

Pedestrian detection based on CNN runs under Windows 10 system environment. The pedestrian image is detected by using Python3.5, VS2015 programming software combined with MATLAB2014a, TensorFlow, and Caffe.

Analyses of experimental results

In the process of improving LeNet-5 network training, we can intuitively and clearly observe the dynamic process of improving LeNet-5 training by drawing accuracy and loss curves. Figure 5 shows the corresponding accuracy and loss curves of the improved LeNet-5 network during training. As can be seen from the figure, when the epoch with the number of network iterations is 1500–2000, the network reaches stability.

Figure 7.

Smart car.

Figure 8.

Identification platform.

In order to test the applicability of the improved LeNet-5 CNN algorithm proposed in this paper, the VJ, HOG+SVM, HOGLBP, Multifu+Motion, LeNet-5 CNN, and the improved LeNet-5 CNN detection algorithm were compared. The number of network iterations is 1500–2000 epochs. Training data and testing data were obtained from Caltech datasets at the California Institute of Technology.¹⁷ There are 11 data packages in this dataset, of which set00–set05 are the training set, and set06–set9 are the test sets. There are a total of 64,468 data samples, including 4396 positive samples and 60,072 negative samples. As shown in Figure 9, when the network iterated with 1500–2000 epochs, the network reached stability.

As shown in Figure 10, when the false positive rate FPPI is 10%, improved LeNet CNN achieves 25% missed detection rate, while LeNet CNN achieves 33% missed detection rate, which was much lower than the traditional statistical learning methods such as VJ, HOG+SVM, MultiFtr+Motion, etc. It also shows that improved LeNet CNN performance is better than leNet-CNN. So when training on Caltech dataset, improved LeNet CNN has good applicability. As can be seen from Figure 10, SA-fastRCNN, the best method for pedestrian detection can achieve a miss rate of 13%. It is also a pedestrian detection method based on CNN. Considering the influence of pedestrian size, different detection models are adopted according to different pedestrian heights. Although the proposed convolutional neural network method in this paper has a gap of nearly 12% in comparison, the improved LeNet CNN proposed in this paper has scale invariance and has near real-time processing speed, and the miss detection rate can basically meet the actual needs.

Figure 9.

Accuracy and loss of curves training.

Figure 10.

Overall results on the Caltech dataset.

On the same experimental equipment platform, the performance of the algorithm is analyzed by running time. The above three detection algorithms are trained with set05 dataset. Table 2 shows the training time comparison of the three algorithm network models. After the three classifiers have been trained, the set9 dataset was detected and the mean of all detection time was taken. Table 3 shows the comparison of the average detection running time for the three algorithm network models.

Table 1.

Improved LeNet-5 CNN parameters.

	Types	Convolution Kernel size	Nuclear species sample type	Feature map	Neurons quantity	Trainable parameter	Connections
C1	Convolution layer	5 × 5	6	28 × 28	4704	156	122,304
S1	Downsampling	2 × 2	6	14 × 14	1176	12	5880
C2	Convolution layer	5 × 5	12	10 × 10	1600	1516	151,600
S2	Downsampling	2 × 2	12	5 × 5	400	32	2000
F6	Full connection		320	1 × 1	320	10,164	10,164
	Output layer			1 × 43	43		43

Table 2.

Comparison of training time between three algorithm network models.

Algorithmic network model	Ours	SA-FastR-CNN	LeNet CNN
Training time (hours)	0.701	0.892	1.096

Table 3.

Comparison of detection time between three algorithm network models.

Algorithmic network model	Ours	SA-FastR-CNN	LeNet CNN
Detection time (seconds/frame)	0.204	0.359	0.598

According to Table 2, the training time of ours, SA-FastR-CNN, and LeNet CNN algorithms are respectively 0.701, 0.892, and 1.096 h. According to Table 3, it takes about 0.598 s to detect each image by using the LeNet CNN algorithm, which is the longest time among the three algorithms. It takes about 0.359 s to detect images per frame by using SA-FastR-CNN, more than 0.155 s of the proposed algorithm. This is because BN normalization algorithm, dynamic adaptive pooling model, and improved Relu activation function are adopted in the improved LeNet-5 CNN layer, which speeds up the operation of the network.

This paper also shows the actual result of the classical LeNet-5 CNN and the improved LeNet-5 CNN model algorithm. When verifying the real time and accuracy of the LeNet-5 CNN model and the improved LeNet-5 CNN model, pedestrians are effectively detected in the video. The results of the detection are shown in Figures 11 and 12.

Figure 11.

LeNet-5 CNN structure pedestrian detection.

Figure 12.

Improved LeNet-5 CNN structure pedestrian detection.

As can be seen from Figure 11, when the pedestrian scale is relatively small, the LeNet CNN algorithm can hardly detect pedestrians, and there are many cases of missed and mistaken detection. Especially when two pedestrians overlap and there is large human traffic, only one can be detected. The improved LeNet CNN algorithm in Figure 12 improves the detection efficiency of occluded pedestrians, smaller scale pedestrians, and so on, but occasionally misjudgments occur. In general, the improved LeNet-5 CNN algorithm has better detection rate than the classical LeNet-5 CNN algorithm in real time and complex background.

Conclusion

This paper improves the classical LeNet-5 CNN from the following three aspects. first, a normalized layer BN is added after each convolution layer, then a dynamic adaptive pooling model is adopted, and the improved activation function of Relu can effectively alleviate the problem of gradient disappearing. In this paper, the Caltech pedestrian database was detected and analyzed by nine-layer LeNet CNN model. Experiments show that the improved LeNet CNN detection time saves 0.155 s compared with SA-FastR-CNN, and saves 0.394 s compared with the mainstream LeNet CNN detection time. In the case of occlusion, weak illumination, and complex environment, the missing detection rate of the proposed model can reach 25%, and it has high real time and accuracy. But the improved LeNet-5 network will also have misjudgment, so the future studies will focus on this aspect.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Natural Science Foundation of Shaanxi Province of China (2016GY-007) and the key research and development program of Shaanxi Province (2016KTZDGY4-05–1). It was also supported by the Xi'an University of Science and Technology.

References

Rodolfo TE and Miguel TT. Robust lane sensing and departure warning under shadows and occlusions. Sensors 2013; 13: 3270–3298.

Cai

Hai

Chen

, et al. Pedestrian detection algorithm for driver assistance system based on fused saliency. Autom Eng 2015; 37: 1215–1220.

Chen

Yuille

Articulated pose estimation by a graphical model with image dependent pairwise relations. Eprint Arxiv 2014; 1736–1744.

Tompson

Jain

Lecun

, et al. Joint training of a convolutional network and a graphical model for human pose estimation. Eprint Arxiv 2014; 1799–1807.

Wang

Ouyang

Wang

, et al. Visual tracking with fully convolutional networks. In: IEEE international conference on computer vision, 2015, pp. 3119–3127. Washington DC: IEEE Computer Society.

Nam

Han

Learning multi-domain convolutional neural networks for visual tracking. In: The IEEE conference on computer vision and pattern recognition, June 2016, pp. 4293–4302.

Wang

, et al. Comprehensive evaluation of implementation effect on later-period supportive policy of reservoir resettlement based on ANFIS. Adv Mater Res 2014; 912–914: 1874–1878.

Agrawal

Girshick

Malik

Analyzing the performance of multilayer neural networks for object recognition. In: Computer vision-ECCV 2014. New York: Springer International Publishing, 2014, pp. 329–344.

Wang ZF, Su HT, Chen HS, et al. A model of target detection in variegated natural scene based on visual attention. Appl Mech Mater 2013; 333--335: 1213–1218.

10.

Monti

Baroffio

Bondi

, et al. Deep convolutional neural networks for pedestrian detection. Image Commun 2016; 47: 482–489.

11.

Zhang

Ren

, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 2015; 37: 1904–1916.

12.

Zhang

Liu

, et al. Robust visual tracking via convolutional networks without training. IEEE Trans Image Process 2016; 25: 1779–1792.

13.

Peng

, et al. Vision-based object detection and tracking: a review. Acta Autom Sin 2016; 42: 1466–1489.

14.

Gkioxari

Dollar

, et al. In: IEEE international conference on computer vision (ICCV). Washington DC: IEEE Computer Society, 2017, pp. 1388–1397.

15.

Zhang

Donahue

Girshick

, et al. Part-based R-CNNs for fine-grained category detection. In: European conference on computer vision. Cham: Springer, 2014, pp. 834–849.

16.

Liu

Ze-Min

Lei

, et al. Pedestrian detection based on objectness and space-time covariance features. Comput Sci 2018; 45(S1): 210--214,246.

17.

Zheng

HHS.

Image classification and annotation based on robust regularized coding. Signal Image Video Process 2016; 10: 1–10.