Improve YOLOv3 using dilated spatial pyramid module for multi-scale object detection

Abstract

Effectively and efficiently recognizing multi-scale objects is one of the key challenges of utilizing deep convolutional neural network to the object detection field. YOLOv3 (You only look once v3) is the state-of-the-art object detector with good performance in both aspects of accuracy and speed; however, the scale variation is still the challenging problem which needs to be improved. Considering that the detection performances of multi-scale objects are related to the receptive fields of the network, in this work, we propose a novel dilated spatial pyramid module to integrate multi-scale information to effectively deal with scale variation problem. Firstly, the input of dilated spatial pyramid is fed into multiple parallel branches with different dilation rates to generate feature maps with different receptive fields. Then, the input of dilated spatial pyramid and outputs of different branches are concatenated to integrate multi-scale information. Moreover, dilated spatial pyramid is integrated with YOLOv3 in front of the first detection header to present dilated spatial pyramid-You only look once model. Experiment results on PASCAL VOC2007 demonstrate that dilated spatial pyramid-You only look once model outperforms other state-of-the-art methods in mean average precision, while it still keeps a satisfying real-time detection speed. For 416 × 416 input, dilated spatial pyramid-You only look once model achieves 82.2% mean average precision at 56 frames per second, 3.9% higher than YOLOv3 with only slight speed drops.

Keywords

Real-time object detection YOLOv3 scale variation dilated spatial pyramid receptive fields

Introduction

Real-time multi-scale object detection is one of the most challenging tasks in computer vision. Generally, the traditional object detection algorithms are composed of three stages: select candidate regions on the given image, extract features from these regions, and finally classify each region with trained classifier. The performance of such kind of algorithms generally depends on the expression ability of features extracted by designers.

In recent years, with the development of big data technology and the improvement of computing performance, deep convolutional neural networks (DCNNs) have achieved significant advances in object detection. Existing object detection algorithms based on DCNN can be roughly divided into two categories: (1) two-stage methods, mainly including R-CNN,¹ Fast R-CNN,² Faster R-CNN,³ and R-FCN⁴ and (2) one-stage methods, mainly including You only look once (YOLO)⁵ and Single Shot MultiBox Detector (SSD).⁶

The two-stage methods firstly generate a series of region proposals and then do feature extraction with CNN for classification and bounding box regression. Although the two-stage methods have achieved competitive performance, they are too slow for real-time applications due to its intensive computation cost.

The one-stage methods consider object detection as a single regression problem and are able to realize real-time detection due to its high computational efficiency, while its accuracy is usually lower than that of those two-stage methods. YOLO⁵ and SSD⁶ are two representative one-stage methods. The YOLO algorithm has gone through three stages of development: (1) YOLO⁵ divides the input image into s × s gird cells, but each grid cell can only predict one kind of objects and, therefore, YOLO has difficulty on dense and small object detection; (2) YOLOv2⁷ improves the base network of YOLO and adopts anchor mechanism and multi-scale training method, consequently, both accuracy and speed have been improved; and (3) YOLOv3⁸ proposes a new network Darknet-53 and predicts objects at three different scales. Considering the algorithm performance, for 320 × 320 input, YOLOv3 achieves 28.2% mean average precision (mAP) in 22 ms on MS COCO data set, being as accurate as SSD but with a three-time faster speed.

Although YOLOv3 has very good performance on small object detection with multi-scale predictions, it performs relatively worse on medium and larger size objects.⁸ Much has been done to improve the accuracy of the YOLO algorithm. In 2018, Li and Yang⁹ replaced the standard convolution with depth-wise separable convolution and introduced feature pyramid network (FPN)¹⁰ into the detection module in YOLOv2 to reduce parameters and improve feature extraction ability, and the improved YOLOv2 algorithm outperforms YOLOv2 on small object detection. In 2019, Zhang et al.¹¹ added three residual blocks to the bottom of residual network of original YOLOv3 and designed six multi-scale convolutional feature maps for prediction to formulate DF-YOLOv3 model, thus improving the vehicles detection accuracy in complex scenes. To make full use of multi-scale feature maps, Huang and Wang¹² proposed DC-spatial pyramid pooling (SPP)-YOLO model, which adopted dense connection and an improved SPP¹³ module to YOLOv2. Xu et al.¹⁴ proposed Attention-YOLO, in which channel and spatial attention mechanism is added to the feature extraction network of YOLOv3; as a result, Attention-YOLO achieves about 0.6% AP₅₀ higher than YOLOv3 on MS COCO data set.

Although the above improved YOLO algorithms improve the accuracy, the inference speed decreases obviously at the same time, and recognizing objects at multiple scales is still the challenging problem for YOLOv3 which is waiting to be improved.

There are a variety of ways to remedy the scale variation problem for object detector. The direct but inefficient method is to extract features and predict different scales of images, such as image pyramid,¹⁵ which is widely used in hand-engineered feature-based methods.^16,17 Instead of taking multiple scales of images as input, one kind of methods deals with scale variation problem by exploiting multiple layers in CNN. SSD⁶ used a pyramidal feature hierarchy structure to perform object detection at different layers and finally combined predictions from multiple feature maps with different resolutions to deal with objects of various sizes. FPN¹⁰ utilized a top-down architecture with lateral connections to build high-level semantic feature maps at all scales, which has shown significant improvements over several baselines. Feature Fusion Single Shot Multibox Detector (FSSD)¹⁸ concatenated features from different layers with different scales and then generated a new pyramidal feature hierarchy structure to predict objects by down-sampling the concatenation layer for several times; as a result, FSSD improved the accuracy significantly over SSD with only a slight decrease on speed. PANet¹⁹ added bottom-up path augmentation to the FPN, which shortened the information path between low-level feature maps and high-level feature maps to enhance the feature fusion structure.

Instead of conducting feature fusion between different layers with different resolutions, some other methods resample the feature maps at the same scale to obtain multi-scale information. He et al.¹³ proposed a feature extraction module named SPP, which resampled the feature maps by using different pooling rates, and test results showed that the objects at different scales can be accurately classified by resampling the same feature maps. Liu et al.²⁰ proposed a receptive filed block (RFB) module consisting of multi-branch convolution layers with different dilation rates and kernels, and by simply replacing the top convolution layers of SSD with RFB, the performance get significantly improved. Sachin Mehta et al.²¹ proposed an efficient spatial pyramid consisting of spatial pyramid of dilated convolutions and point-wise convolutions for semantic segmentation.

YOLOv3 adopts the method that exploiting multiple layers for detection and thus the performance of small object detection is improved, but it still performs relatively worse on medium and larger objects. Inspired by the method that resampling the feature maps at the same scale, a novel dilated spatial pyramid (DSP) module is proposed in this article to integrate multi-scale information by resampling the feature maps to deal with scale variation problem.

The main contribution of this article is as follows:

A novel DSP module consisting of multiple parallel branches with different dilation rates is proposed to learn multi-scale information from large effective fields to deal with scale variation problem.

DSP-YOLO is presented by integrating DSP module to YOLOv3 in front of the first detection header and it gains significant improvement in accuracy, while it is still able to maintain comparable speed with YOLOv3.

Experiments are designed to explore the optimal number of branches and fusion method of DSP and evaluate the performance among DSP-YOLO and other state-of-the-art object detectors.

The experiment results demonstrate that our DSP-YOLO model outperforms other state-of-the-art object detection methods in accuracy on PASCAL VOC2007, while it is still able to keep a satisfying real-time detection speed.

Proposed methods

The structure of original YOLOv3 is shown in Figure 1. Through extracting features from three different scales and combing low-resolution and high-resolution features via a top-down pathway and lateral connections, YOLOv3 improves the performance of small object detection ability, but it still performs relatively worse on medium and large size objects.⁸

Figure 1.

The structure of YOLOv3. YOLOv3 predicts objects at three different scales and finally combines the results to get the final detection. YOLO: You only look once.

Theoretically, the detection performance on objects with different scales is influenced by the effective receptive field of the network. Large effective receptive field could be used to enhance the performance of larger scale object detection, but it compromises the performance on small objects at the same time.²² Thus, exploiting multi-scale receptive fields is one of the ways to improve the ability of multi-scale object detection.

Large effective receptive fields can be achieved by conducting down-sample operation, increasing convolutional layers, applying dilated convolution (atrous convolution),²³ and enlarging the size of the convolutional kernel. Suppose the down-sample rate is s, then the receptive field is increased by s times, while the resolution of the feature map is reduced by s times, which may result in information loss. Increasing both convolutional layers and the kernel size of convolutional filters increases the receptive field at the cost of increasing the number of parameters and deep network may cause the problem of vanishing gradient during training process. Dilated convolution expands the convolutional kernel by inserting holes between each pixel, enlarging the effective receptive filed without additional parameters and computations cost. As shown in Figure 2, dilation rate m means inserting m − 1 holes between each pixel and the dilated convolution is equal to standard convolution when dilation rate is set as 1.

Figure 2.

Dilated convolution with 3 × 3 kernel size and different dilation rates: (a) the dilation rate is 1, (b) the dilation rate is 2, and (c) the dilation rate is 3.

Suppose the kernel size of the dilated convolution is k, dilation rate is r, then the actual kernel size k^′ is $k^{'} = k + (k - 1) (r - 1) = kr - r + 1$

Suppose stride is 1, a k × k dilated convolution can obtain the same receptive fields as the standard convolution with kernel size of $(kr - r + 1)$ .

Dilated convolutional layers with different dilation rates will generate different receptive fields, so the choice of dilation rate depends on the scale of the object to be measured. To deal with scale variation problem, the receptive fields are expected to balance between small- and large-scale objects. Inspired by SPP, a novel DSP module consisting of multiple parallel branches with different dilation rates is proposed in this article. As shown in Figure 3, in DSP structure, the multiple parallel branches resample the convolutional feature maps using a 3 × 3 convolution with different dilation rates simultaneously to generate feature maps with different receptive fields. The kth branch uses dilation rate k and the padding parameter $p = 3 (k - 1) / 2$ to keep the resolution of the input feature map invariable after dilated convolution while increasing the receptive fields. Then, we proposed the following two methods to fuse feature maps with different receptive fields to obtain multi-scale information:

Figure 3.

The structure of DSP module: (a) the input of DSP and outputs of different branches are concatenated to integrate multi-scale information; (b) using element-wise addition between the input of DSP and output of 1 × 1 standard convolutional layer to improve the gradient flow inside the network. DSP: dilated spatial pyramid.

The first method is concatenation. As shown in Figure 3(a), the input of DSP a module and outputs of different dilated convolutional layers are concatenated in the channel dimension of feature maps to obtain richer semantic information.

The second method is element-wise addition operation. As shown in Figure 3(b), the outputs of different branches are concatenated in the channel dimension of feature maps. The following 1 × 1 standard convolutional layer is used to reduce the number of the channel of feature maps, since element-wise addition is performed on two feature maps channel by channel, requiring the input and output are of the same dimensions. To improve the gradient flow inside the network, the element-wise addition is performed between the input of DSP b module and output of 1 × 1 standard convolutional layer.

As shown in Figure 4(a), the DSP a module is integrated with YOLOv3 after the third convolutional layers in the first convolutional set (the gray block in Figure 1) and 1 × 1 standard convolutional layer is added following DSP a module to reduce dimensions, leaving the 3 × 3 convolutional layer with smaller input dimensions to reduce computation.

Figure 4.

The structure of convolutional set before the first detection header of DSP-YOLO. (a) DSP a is integrated after the third convolutional layer in the first convolutional set and 1 × 1 convolutional layer is added to reduce dimensions; (b) DSP b is integrated between the third and fourth convolutional layers in the first convolutional set. DSP: dilated spatial pyramid; YOLO: You only look once.

The DSP b module is integrated with YOLOv3 between the third and fourth convolutional layers in the first convolutional set (the gray block in Figure 1) to present DSP-YOLO b model. The corresponding structure is shown in Figure 4(b).

Experiments

The performance of the DSP-YOLO is compared with state-of-the-art object detection methods on PASCAL VOC 2007 data set,²⁴ which has 20 categories, and mAP and frames per second (FPS) are used to evaluate the performance of our approach.

Regarding the experiment on VOC2007 test, the union of VOC2007 trainval and VOC2012 trainval is used as the training data and the VOC2007 test (4952 images) as the test data.

Experiment condition

The proposed model is trained on GPU, and the detailed experimental configurations are listed in Table 1.

Table 1.

Experimental configurations.

Hardware and software	Profile
CPU	Intel(R) Core(TM) i7-8700 CPU @ 3.20 GHz
GPU	NVIDIA Titan Xp
Deep learning framework	Darknet
Operating system	Ubuntu 16.04

Training strategy

The DSP-YOLO is based on Darknet53, which is pretrained on the ImageNet,²⁵ and it is trained in a batch size of 64 on a Titan Xp GPU with a starting learning rate of 10⁻³, which is divided by 10 at 40,000 and 45,000 iterations. The max training iteration is 60,000, the weight decay is 0.0005, and momentum is 0.9. The training strategies mostly follow YOLOv3, including multi-scale training, data augmentation, convolutional with anchor boxes, and loss function. For a fair comparison, the YOLOv3 and YOLOv3-spp are trained in the same way as that of DSP-YOLO.

Ablation study on VOC2007

The number of branches in DSP module

To find the optimal number of branches in DSP, DSP-YOLO a with different number of branches is evaluated on PASCAL VOC2007, and the performance evaluated with 320 × 320 input is recorded in Table 2. According to the results in Table 2, with three branches, DSP-YOLO a achieves 79.8%, 4.2% higher than the original YOLOv3, demonstrating that the DSP structure indeed improves the performance of YOLOv3 though with slight speed drop. With five branches, DSP-YOLO a achieves 79.7% mAP, 0.3% lower than that with four branches, indicating that five branches do not bring further improvement over four branches. Considering the accuracy and speed, the branches of DSP module is selected as 4.

Table 2.

PASCAL VOC2007 test detection results with different number of branches.

Method	Fusion method	Branches	mAP (%)	FPS
YOLOv3	Concatenation	0	75.6	75.8
DSP-YOLO a	Concatenation	3	79.8	72.3
DSP-YOLO a	Concatenation	4	80.0	72
DSP-YOLO a	Concatenation	5	79.7	71.9

mAP: mean average precision; FPS: frames per second; DSP: dilated spatial pyramid; YOLO: You only look once.

Fusion method of feature maps in DSP module

To evaluate the performances of different fusion methods of feature maps in DSP module, DSP-YOLO a and DSP-YOLO b are evaluated on PASCAL VOC2007, and the performances evaluated with 320 × 320 input are recorded in Table 3. According to the results in Table 3, DSP-YOLO a achieves 80.0% mAP, little higher than that of DSP-YOLO b, indicating that concatenation is more effective for feature maps fusion in DSP module. Considering the accuracy and speed, the DSP-YOLO a with 4 branches is selected and called DSP-YOLO.

Table 3.

PASCAL VOC2007 test detection results with different fusion methods in DSP module.

Method	Branches	Fusion method	mAP (%)	FPS
DSP-YOLO a	4	Concatenation	80.0	72
DSP-YOLO b	4	Element-wise addition	79.9	72

mAP: mean average precision; FPS: frames per second; DSP: dilated spatial pyramid; YOLO: You only look once.

Experiments on PASCAL VOC 2007

Table 4 shows the PASCAL VOC2007 test detection results. Since Redmon and Farhadi did not present the results of the original YOLOv3 and YOLOv3-spp on PASCAL VOC 2007 test in this article,⁸ the YOLOv3 and YOLOv3-spp are trained in the same training strategies as DSP-YOLO based on this article to get the benchmark results.

Table 4.

PASCAL VOC2007 test detection results.

Method	mAP	Aero	Bike	Bird	Boat	Bottle	Bus	Car	Cat	Chair	Cow	Table	Dog	Horse	Mbike	Person	Plant	Sheep	Sofa	Train	TV
Fast²	70.0	77.0	78.1	69.3	59.4	38.3	81.6	78.6	86.7	42.8	78.8	68.9	84.7	82.0	76.6	69.9	31.8	70.1	74.8	80.4	70.4
Faster³	73.2	76.5	79.0	70.9	65.5	52.1	83.1	84.7	86.4	52.0	81.9	65.7	84.8	84.6	77.5	76.7	38.8	73.6	73.9	83.0	72.6
R-FCN⁴	80.5	79.9	87.2	81.5	72.0	69.8	86.8	88.5	89.8	67.0	88.1	74.5	89.8	90.6	79.9	81.2	53.7	81.8	81.5	85.9	79.9
SSD300⁶	74.3	75.5	80.2	72.3	66.3	47.6	83.0	84.2	86.1	54.7	78.3	73.9	84.5	85.3	82.6	76.2	48.6	73.9	76.0	83.4	74.0
SSD512⁶	76.8	82.4	84.7	78.4	73.8	53.2	86.2	87.5	86.0	57.8	83.1	70.2	84.9	85.2	83.9	79.7	50.3	77.9	73.9	82.5	75.3
DSSD513²⁶	81.5	86.6	86.2	82.6	74.9	62.5	89.0	88.7	88.8	65.2	87.0	78.7	88.2	89.0	87.5	83.7	51.1	86.3	81.6	85.7	83.7
STDN513²⁷	80.9	86.1	89.3	79.5	74.3	61.9	88.5	88.3	89.4	67.4	86.5	79.5	86.4	89.2	88.5	79.3	53.0	77.9	81.4	86.6	85.5
YOLOv2 416⁷	76.8	87.9	87.5	78.2	61.5	57.9	84.9	82.9	90.6	54.9	83.6	66.5	90.1	85.2	85.8	82.9	54.2	78.9	65.2	87.3	69.8
Improved YOLOv2⁹	77.3	88.2	87.3	79.3	61.3	61.0	85.2	81.3	90.0	58.8	83.4	68.3	89.3	84.6	85.5	82.7	54.1	80.2	66.3	87.6	71.8
DC-SPP-YOLO 416¹²	78.4	80.0	84.9	76.0	68.0	53.8	87.6	83.9	90.1	62.5	84.1	75.8	88.6	87.3	85.7	77.0	54.3	81.7	80.1	88.3	78.7
DC-SPP-YOLO 544¹²	79.6	83.1	85.9	77.2	69.5	59.7	88.5	86.3	89.9	62.6	86.0	78.3	87.6	88.0	86.7	80.1	54.3	81.3	80.4	87.6	79.4
YOLOv3 416⁸	78.3	88.7	84.3	76.1	67.6	62.8	85.7	88.8	88.9	60.4	83.6	71.6	86.0	87.9	86.4	81.7	49.1	81.1	76.6	84.9	74.7
YOLOv3-spp 416⁸	78.4	87.5	84.3	78.5	66.1	66.0	86.4	88.7	89.5	57.6	81.5	71.3	86.1	87.3	85.2	82.6	51.7	79.4	75.8	83.2	78.7
DSP-YOLO 320	80.0	86.4	87.5	77.3	70.3	61.8	85.3	88.3	89.0	64.4	83.8	76.2	85.3	87.9	88.2	83.3	53.2	81.5	80.6	87.8	81.1
DSP-YOLO 416	82.2	88.5	89.5	79.1	74.0	68.7	89.7	90.6	89.9	66.7	84.4	75.0	89.2	89.3	89.8	85.8	56.6	84.4	81.1	89.1	81.6
DSP-YOLO 544	82.8	91.0	89.9	82.0	75.2	70.5	91.5	92.3	91.6	67.1	87.0	76.1	89.2	91.6	88.3	87.0	55.9	86.7	79.8	85.5	78.8
DSP-YOLO 608	83.1	91.0	90.7	81.8	75.6	73.8	91.3	92.7	91.2	66.9	86.9	75.5	89.0	90.4	88.6	87.3	55.1	87.3	80.0	86.9	80.0

mAP: mean average precision; DSP: dilated spatial pyramid; YOLO: You only look once; SPP: spatial pyramid pooling; SSD: Single Shot MultiBox Detector.

The boldface values represents the maximum value in the column.

With low-dimension input images (e.g. 320 × 320), DSP-YOLO achieves 80.0% mAP, 1.0% lower than DSSD513.²⁶ However, with high-dimension input (e.g. 608 × 608), DSP-YOLO can produce 83.1% mAP, much better than other improved YOLO models, like DC-SPP-YOLO¹² and other state-of-the-art methods like R-FCN,⁴ DSSD513,²⁶ STDN513,²⁷ and so on.

For 416 × 416 input, compared to original YOLOv3, DSP-YOLO improves detection performance in almost every category, improving the ability for multi-scale detection, which proves the effectiveness of the proposed DSP module.

Inference time

Table 5 presents the comparison test results among the proposed DSP-YOLO and the state-of-the-art object detection methods in mAP and FPS on PASCAL VOC2007 test. With NVIDIA Titan Xp and CUDA 9.1, DSP-YOLO achieves 82.2% mAP at 56 FPS with input 416 × 416, being as accurate as RFB Net512 but much more faster. For high-dimension input (e.g. 608 × 608), DSP-YOLO achieves 83.1% mAP at 31 FPS, surpassing other state-of-the-art object detection models including two-stage and one-stage methods and other improved YOLOv3 algorithms, while it still maintains real-time detection speed.

Table 5.

Comparison of object detection methods on PASCAL VOC2007 test.

Method	Base network	mAP(%)	FPS	GPU
Faster R-CNN³	VGG16	73.2	7	Titan X
R-FCN⁴	ResNet-101	80.5	9	Titan X
RefineDet320²⁸	VGG16	80.0	40.3	Titan X
RefineDet512²⁸	VGG16	81.8	24.1	Titan X
RFB Net300²⁰	VGG16	80.5	83	Titan X
RFB Net512²⁰	VGG16	82.2	38	Titan X
STDN513²⁷	DenseNet-169	80.9	28.6	Titan Xp
SSD300⁶	VGG16	74.3	46	Titan X
SSD512⁶	VGG16	76.8	19	Titan X
DSSD321²⁶	ResNet-101	78.6	9.5	Titan X
DSSD513²⁶	ResNet-101	81.5	5.5	Titan X
YOLOv2 416⁷	Darknet19	76.8	67	Titan X
YOLOv3 416⁸	Darknet53	78.3	57.7	Titan Xp
YOLOv3-spp 416⁸	Darknet53	78.4	56.7	Titan Xp
DC-SPP-YOLO 416¹²	Darknet19	78.4	56.3	GTX 1080 Ti
DC-SPP-YOLO 544¹²	Darknet19	79.6	38.9	GTX 1080 Ti
Attention-YOLO-B¹⁴	Darknet53	81.9	26	Titan X
DSP-YOLO 416	Darknet53	82.2	56	Titan Xp
DSP-YOLO 544	Darknet53	82.8	36	Titan Xp
DSP-YOLO 608	Darknet53	83.1	31	Titan Xp

mAP: mean average precision; FPS: frames per second; DSP: dilated spatial pyramid; YOLO: You only look once; SPP: spatial pyramid pooling; SSD: Single Shot MultiBox Detector.

Visualization

In Figure 5, we show several detection examples of YOLOv3 and DSP-YOLO on PASCAL VOC2007 test. Compared to original YOLOv3, the proposed DSP-YOLO model improves in two scenarios. The first scenario is for detection on multi-scale objects. YOLOv3 has relatively poor performance on medium and larger object detection and has the problem of false detection (the false detection “train” and “aeroplane” in Figure 5(a)), repeated detection (Figure 5(c)) and missed detection (Figure 5(c), (e), and (g)), but DSP-YOLO shows significant improvement, as shown in Figure 5(b), (d), (f), and (h). And comparing Figure 5(i) with Figure 5(j), DSP-YOLO also works well on small object detection. The second scenario is for detection on occluded object. As shown in Figure 5(a) to (d) and (k) to (n), objects which are heavily occluded are well detected with DSP-YOLO.

Figure 5.

Comparison of detection results of YOLOv3 and DSP-YOLO. The first column shows the results of YOLOv3 and the second column is the result of DSP-YOLO. DSP: dilated spatial pyramid; YOLO: You only look once.

Conclusions

To improve the multi-scale object detection performance of YOLOv3, in this article, a novel DSP module consisting of multiple parallel branches with different dilation rates is proposed and integrated with YOLOv3 in front of the first detection header to present DSP-YOLO model. Experiments show that our DSP-YOLO model outperforms YOLOv3 on PASCAL VOC2007 for not only its medium and larger object detection ability but also its detection ability on occluded and small objects, and at the same time, it is still able to maintain the comparable speed as that of YOLOv3. What’s more, for 608 × 608 input, DSP-YOLO achieves 83.1% mAP, surpassing other state-of-the-art object detectors on PASCAL VOC2007 while keeping the real-time speed. In addition, although the DSP module is only applied to YOLOv3 in this article, it can act as a generic solution for scale-invariant representations and can be integrated with other object detectors like SSD.

In the future work, DSP-YOLO will be applied to the autonomous driving filed to deal with the car and people detection in both close and far scene.

Footnotes

Acknowledgements

The authors would like to thank the support from Jiangsu Overseas Visiting Scholar Program for University Prominent Young & Middle-aged Teachers and Presidents.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work has supported by the projects of the National Key Research and Development Plan of China [Grant Number: 2016YFB0502103] and the Natural Science Foundation of Jiangsu Province of China [Grant Number: BK20160696].

ORCID iD

Xiaoguo Zhang

References

Girshick

Donahue

Darrell

, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE conference on computer vision and pattern recognition, CVPR 2014 (eds Schmid

Soatto

Tomasi

), Columbus, OH, USA, 23–28 June 2014, pp. 580–587. Piscataway, NJ: IEEE.

Girshick

Fast R-CNN. In: IEEE international conference on computer vision (eds Grauman

Grauman

Zisserman

), Santiago, Chile, 7–13 December 2015, pp. 1440–1448. Piscataway, NJ: IEEE.

Ren

Girshick

, et al. Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neur In 2015; 39(6): 91–99.

Dai

, et al. R-FCN: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems 29: Annual conference on neural information processing systems 2016 (eds Lee

Sugiyama

Luxburg

Guyon

Garnett

), Barcelona, Spain, 5–10 December 2016, pp. 379–387. Cambridge, MA: MIT Press.

Redmon

Divvala

Girshick

, et al. You only look once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 779–788. Piscataway, NJ: IEEE.

Liu

Anguelov

Erhan

, et al. SSD: single shot multi-box detector. In: European conference on computer vision (eds Leibe

Matas

Sebe

Welling

), Amsterdam, The Netherlands, 11–14 October 2016, pp. 21–37. Berlin, German: Springer.

Redmon

Farhadi

. YOLO9000: better, faster, stronger. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017 (eds Liu

Rehg

Taylor

), Honolulu, HI, USA, 21–26 July 2017, pp. 6517–6525. Piscataway, NJ: IEEE.

Redmon

Farhadi

. YOLOv3: an incremental improvement. arXiv preprint arXiv: 1804.02767, 2018.

Yang

. Improved YOLOv2 object detection model. In: 2018 6th International conference on multimedia computing and systems (ICMCS) (eds Essaaidi

Zaz

), Rabat, Morocco, 10–12 May 2018, pp. 1–6. Piscataway, NJ: IEEE.

10.

Lin

Dollár

Girshick

, et al. Feature pyramid networks for object detection. In: IEEE conference on computer vision and pattern recognition CVPR 2017 (eds Liuxy

Rehg

Taylor

), vol. 1, Honolulu, HI, USA, 21–26 July 2017, pp. 936–944. Piscataway, NJ: IEEE.

11.

Zhang

Yang

. Fast vehicle detection method based on improved YOLOv3. Com Eng App 2019; 55: 12–20.

12.

Huang

Wang

. DC-SPP-YOLO: dense connection and spatial pyramid pooling based YOLO for object detection. arXiv preprint arXiv:1903.08589, 2019.

13.

Zhang

Ren

, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE T Pattern Anal 2015; 37: 1904–1916.

14.

Wang

Yang

. Attention-YOLO: YOLO detection algorithm that introduces attention mechanism. ComEngApp 2019; 55(6): 13–23.

15.

Adelson

Anderson

Bergen

, et al. Pyramid methods in image processing. RCA Engineer 1984; 29: 33–41.

16.

Dalal

Triggs

. Histograms of oriented gradients for human detection. In: IEEE computer society conference on computer vision and pattern recognition CVPR 2005 (eds Schmid

Soatto

Tomasi

), San Diego, CA, USA, 20–25 June 2005, pp. 4278–4284. Piscataway, NJ: IEEE.

17.

Lowe

. Distinctive image features from scale-invariant keypoints. Int J Comput Vis 2004; 60(2): 91–110.

18.

Zhou

. FSSD: feature fusion single shot multibox detector. arXiv preprint arXiv: 1712.00960, 2017.

19.

Liu

Qin

, et al. Path aggregation network for instance segmentation. In: IEEE conference on computer vision and pattern recognition CVPR 2018 (eds Forsyth

Laptev

Ramanan

Oliva

), Salt Lake City, Utah, USA, 18–22 June 2018, pp. 8759–8768. Piscataway, NJ: IEEE.

20.

Liu

Wang

. Receptive field block net for accurate and fast object detection. arXiv preprint arXiv: 1711.07767, 2017.

21.

Mehta

Rastegari

Caspi

, et al. ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In: European conference on computer vision ECCV (eds Ferrari

Hebert

Sminchisescu

Weiss

), Munich, Germany, 8–14 September 2018, pp. 552–568. Berlin, German: Springer.

22.

Chen

Wang

, et al. Scale-aware trident networks for object detection. arXiv preprint arXiv: 1901.01892, 2019.

23.

Koltun

. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv: 1511.07122, 2016.

24.

Everingham

Van Gool

Williams

CKI

, et al. The pascal visual object classes (VOC) challenge. Int J Comput Vis 2010; 88: 303–338.

25.

Russakovsky

Deng

, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis 2015; 115: 211–252.

26.

Liu

Ranga

, et al. DSSD: deconvolutional single shot detector. arXiv preprint arXiv: 1701.06659, 2017.

27.

Zhou

Geng

, et al. Scale-transferrable object detection. In: IEEE conference on computer vision and pattern recognition CVPR 2018 (eds Forsyth

Laptev

Ramanan

Oliva

), Salt Lake City, Utah, USA, 18–22 June 2018, pp. 528–537. Piscataway, NJ: IEEE.

28.

Zhang

Wen

Bian

, et al. Single-shot refinement neural network for object detection. In: IEEE conference on computer vision and pattern recognition CVPR 2018 (eds Forsyth

Laptev

Ramanan

Oliva

), Salt Lake City, Utah, USA, 18–22 June 2018, pp. 4203–4212. Piscataway, NJ: IEEE.