Separable reverse connected network for efficient multi-scale vehicle detection

Abstract

Vehicle detection is involved in a wide range of intelligent transportation and smart city applications, and the demand of fast and accurate detection of vehicles is increasing. In this article, we propose a convolutional neural network-based framework, called separable reverse connected network, for multi-scale vehicles detection. In this network, reverse connected structure enriches the semantic context information of previous layers, while separable convolution is introduced for sparse representation of heavy feature maps generated from subnetworks. Further, we use multi-scale training scheme, online hard example mining, and model compression technique to accelerate the training process as well as reduce the parameters. Experimental results on Pascal Visual Object Classes (VOC) 2007 + 2012 and MicroSoft Common Objects in COntext (MS COCO) 2014 demonstrate the proposed method yields state-of-the-art performance. Moreover, by separable convolution and model compression, the network of two-stage detector is accelerated by about two times with little loss of detection accuracy.

Keywords

Vehicle detection separable reverse connected network model compression convolutional neural network

Introduction

Vehicle detection is one of the essential tasks in surveillance systems, driver-assistance systems, and a wide range of intelligent transportation and smart city applications. With the rapid increasing of vehicles, population and complexity of traffic environments, accurate and fast vehicle detection problem has become more challenging due to the cluttered scene, variability of vehicle categories and attributes, size, camera viewpoint, and lighting variation. For real applications, vision-based vehicle detection systems should be robust, fast, and accurate so as to deal with different situation. Many efforts have been devoted to design methods and systems, yet the detection accuracy and speed are still unsatisfactory in real-world applications.

The goal of vehicle detection is to determine whether or not there are any vehicles from the given images and, if present, to return the spatial locations and some attributes, such as motor, bus, and car. Traditional vehicle detectors were mainly designed with handcrafted features, such as HOG,¹ SIFT,² Haar-like,³ and sliding-window algorithms such as Selective Search⁴ and Edge Box⁵ were used to generate a large number of redundant windows. Due to the limited representation and discrimination ability of handcrafted features, such methods can hardly provide sufficient accuracy for detection in complex scenes.

In 2012, Krizhevsky et al.⁶ proposed a deep convolutional neural network (CNN) which achieved record-breaking image classification accuracy in the large scale recognition challenge. Deep learning allows computational models consisting of multiple hierarchical layers to learn fantastically complex, subtle, and abstract representations. With the rapid progress of CNN in recent years, CNN-based vehicle detectors have also achieved remarkable performance and become dominant in detection tasks. Nevertheless, CNN and deep learning-based detectors still cannot meet the demands in real applications, particularly in the respects below.

Multiple category detection: There are many different categories of vehicles and each category has variable attributes (size, shape, and color). To simultaneously detect vehicles of multiple categories and attributes is challenging.

Multi-scale detection: Detecting vehicles at vastly different scales is challenging. Particularly, detecting vehicles of small size usually results in lower accuracy.

Speed of detection: Real-time vehicle detection is demanded in most applications, but CNN-based methods are hard to meet the demand due to the high complexity of CNN models.

Deep learning has driven significant progress in a broad range of problems. Recent years have seen big advances in CNN-based object detection to improve the accuracy and speed. CNN-based vehicle detectors can be categorized into two-stage and single-stage ones. A two-stage detector consists of a region proposal network (RPN) and a pipeline for bounding box regression and classification. Falling in this category are the famous R-CNN and its various extensions. For example, Girshick et al.⁷ proposed a method to detect objects from region proposals generated by Selective Search through CNN layers. The detection speed of R-CNN was quite limited by generating a large number of proposals and classification of each proposal. Then, a fully convolutional network (FCN) and region of interest (RoI) pooling technique were applied in Girshick⁸ called as fast R-CNN framework. The milestone faster R-CNN framework⁹ introduced an RPN to generate proposals from convoluted features. The region-based FCN (R-FCN) uses position-sensitive feature map for more precise region proposals¹⁰ and has achieved higher accuracy than R-CNN detectors. The deformable convolutional networks¹¹ also introduced geometric transformations by learning additional offsets without supervision. Lin et al.¹² proposed a feature pyramid network (FPN) for multi-scale detection through exploiting the inherent multi-scale hierarchy of deep convolutional networks. Top-down architecture with lateral connections is utilized to merge feature maps in the pyramid, and state-of-the-art performance has been achieved by the FPN framework.

Single-stage detectors directly predict objects with modified backbones so as to achieve high detection speed. The single shot multi-box detector (SSD)¹³ contains different aspect ratios and scales on hierarchical feature maps for multi-scale object detection. The YOLO framework proposed by Redmon et al.¹⁴ employed a light feature extractor and simplified the detection task as a regression problem. It performs much faster than region-based networks though its accuracy is not competitive. An extended version, YOLO v2¹⁵ gained a great improvement in terms of speed and accuracy.

Generally, two-stage detectors achieve higher accuracy, while single-stage detectors excel in speed. Improved detection accuracy or speed has been achieved by designing more powerful or compact network structures.^16

–21 Some works also combine traditional features or filters with neural networks for improving the performance of vehicle detection.^22
–24 Fusing feature maps of different layers are shown to be beneficial for small object detection.^25,26

In this article, we propose a CNN-based framework, called separable reverse connected (SRC) network, for detecting multi-scale vehicles. In this framework, the reverse connected (RC) structure enriches the semantic context information of previous layers by a top-down pathway and lateral connections, while separable convolution (SC) is introduced for sparse representation of heavy feature maps generated from subnetworks. Further, we use multi-scale training scheme, online hard example mining (OHEM)²⁷ to make training process more efficient for multi-scale vehicle detection, and apply model compression technique²⁸ to reduce the parameters of the networks so as to accelerate detection. Experimental results on Pascal Visual Object Classes (VOC) 2007+2012 and MicroSoft Common Objects in COntext (MS COCO) 2014 demonstrate that the proposed method achieves state-of-the-art performance. Moreover, by SC and model compression, the network of two-stage detector is accelerated by about two times with little loss of detection accuracy.

Proposed method

The issues of multiple category, multi-scale and detection speed are hardly solved simultaneously by previous object detection methods, including two-stage and single-stage CNN-based ones. Some multi-scale feature fusion frameworks, such as the FPN¹² and RC network,²⁵ perform better for multi-scale object detection. However, they also incur increased parameters and computation complexity. Our proposed SRC network inherits the feature representation ability of RC network, and we also introduce SC to reduce the parameters and computation.

Figure 1 shows the overall structure of the proposed vehicle detection method. The backbone network extracts hierarchical convolutional features, which are fused by SRC to take full advantage of semantic context information from deeper feature maps, and SC is introduced to reduce the network parameters and computational costs. Moreover, we use training optimization and model compression techniques to further promote the network performance.

Figure 1.

The overall structure of the proposed method.

Backbone network selection

As shown in Figure 1, the main performance of the entire system is determined by backbone network. To achieve a high-quality vehicle detector, we need to select appropriate network structure for our task of vehicle detection.

Among existing two-stage detectors, faster R-CNN is the most popular framework²⁹ which contains feature extraction and RPNs. R-FCN and FPN are networks which strengthen faster R-CNN in different aspects. ZF, VGG,³⁰ and ResNet³¹ are usually adopted as the feature extractors of networks.

As for single-stage detectors, bounding boxes and class probabilities are predicted directly from full images. YOLO,¹⁴ YOLO v2,¹⁵ and SSD¹³ are reported to be efficient. Darknet and VGG are usually used in these frameworks. The network backbones of typical vehicle detectors are listed in Table 1.

Table 1.

Network backbones of existing vehicle detectors.

Approach	Network	Backbone
Two-stage	Faster R-CNN	ZF
	Faster R-CNN	VGG
	Faster R-CNN	ResNet101
	R-FCN	ResNet101
	Faster R-CNN + FPN	VGG
	Faster R-CNN + FPN	ResNet101
Single-stage	SSD 300	VGG
	SSD 500	VGG
	YOLO	GoogLeNet
	YOLOv2	Darknet

CNN: convolutional neural network; FCN: fully convolutional network; FPN: feature pyramid network; SSD: single shot multi-box detector.

A series of comparative experiments are conducted for evaluating the performances of existing detectors with different backbones. The benchmark databases Pascal VOC0712 and MS COCO are used in the experiments. The experimental results indicate that FPN based on faster R-CNN¹² achieves the best performance among two-stage detectors. For single-stage detectors, the overall performance of SSD500¹³ and YOLOv2¹⁵ are similar. Therefore, faster R-CNN and YOLO are selected to be our backbone base networks. Details of implementation setting and experimental results are discussed in subsequent section.

SRC network

Our purpose is to enhance a CNN-based framework’s feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout. To this goal, we proposed an SRC network for efficient multi-scale vehicle detection.

First, SC is introduced in SRC network for parameters reduction as well as compact representation of feature maps. Recently, the series of inception-based neural networks are proposed and considered to be efficient.^32
–34 SC can be understood as an inception module with a maximally large number of towers. Chollet³⁵ claimed that after replacing inception module with SC, the performance of the deep convolution neural network achieved improvements in detection tasks, which brings the idea of flexible building blocks with deeper convolutions. SC has been adopted in faster R-CNN-based method¹⁹ and achieved good trade-off between performance and computational speed through more efficient use of model parameters.

Usually, a convolution layer attempts to learn filters in a three-dimensional (3-D) space with a two spatial dimensions and a channel dimension. Thus, a single convolution kernel is tasked with simultaneously mapping cross-channel correlations and spatial correlations. SC is usually implemented as first depth-wise spatial convolution and then point-wise convolution. Specifically, SC separates the original $k \times k$ convolutions into $k \times 1$ followed by $1 \times k$ convolutions, then the scale of parameters is reduced from $k^{2}$ to 2k. Additionally, the scale of the feature map can be controlled by $C_{m i d}$ and $C_{o u t}$ , which is shown in Figure 2.

Figure 2.

Illustration of separable convolution. SC: separable convolution.

According to this way, SC can not only be able to provide sparse representation of features but also make the network efficient. To obtain more compact feature maps, we attach two normal convolution layers to the SC layer. The resulting feature maps will be applied to build feature pyramids by the subsequent RC network.

Multi-scale detection has been the key problem of vehicle detection in real applications. It is shown that multi-scale representation can significantly improve detection performance of objects with various scales.²⁹ Featured image pyramids were widely used in the era of handmade features which proved to be critical for achieving good results.

For CNN-based methods, features computed by a deep convolutional network are inherent in multi-scale, pyramidal shape, and the resulting feature maps are of different spatial resolutions. However, large semantic gaps caused by different depths are inevitably incurred. The upper layers have larger receptive fields and strong semantics, while the lower layers have smaller receptive fields and rich geometric details. Clearly, it is not optimal to predict objects of different scales using features from one layer. The method proposed in Kong et al.²⁵ implemented multi-scale detection by detecting objects with different sizes on their corresponding network scales and achieved good performance.

Recently, combining fine-grained features with previous high-resolution feature maps of deeper layers is applied in several multi-scale detection tasks. One of the representative works is FPN,¹² which uses a top-down pathway and lateral connections.

Inspired by the previous works and driven by the success, we introduce an RC network to create a feature pyramid across all scales based on CNN feature hierarchy. Reverse connection adopted in RC network enables the network to detect vehicles on multi-level of CNNs. The proposed RC network has the similar architecture with FPN.¹² However, FPN simply merges the feature maps by 2× up-sampling, whereas in our network, deconvolution layer is used to connect with deep feature maps so as to enrich the semantic context information of previous layers. The proposed RC network performs as a feature pyramid, where predictions are independently made on each level, which makes multi-scale vehicle detection more precise. Structure of the proposed network is shown in Figure 3.

Figure 3.

The structure of FPN and the proposed RCN. FPN: feature pyramid network; RCN: Reverse Connected network.

By integrating SC and reverse connection, the resulting SRC network provides a novel solution for feature map fusion and multi-scale detection while saving computation cost through reducing network parameters. The SRC network accomplishes multi-scale vehicle detection on their corresponding network scales, which is more accurate and easier to be optimized. Moreover, the SRC network can be trained end-to-end with all scales and is used consistently at train/test time, which would be memory friendly.

Figure 4 gives the detailed pipeline of the proposed vehicle detection system. Specifically, after a single-scale image of an arbitrary size is fed into the system, the backbone network extracts hierarchical convolutional features, which are fused by SRC to take full advantages of semantic context information from deeper feature maps. Vehicles are predicted in multiple scales using the rich semantic context information derived from the RC network. The top layer is for predicting vehicles with large size (such as buses and trucks), while the middle layer and the bottom layer are for middle size vehicles (most cars) and small size vehicles (most motors and partial cars), respectively.

Figure 4.

Pipeline of the proposed vehicle detector.

The proposed SRC network is actually a generic feature extractor creating feature pyramids with semantic context information. Advances on optimization technique, such as hard example mining, could be naturally employed. Moreover, as an independent structure, it can be easily applied inside deep convolutional networks. The comparative experiments show that faster R-CNN-based methods³¹ achieve competitive performances among two-stage detectors, while YOLOv2¹⁵ is more potential single-stage detector. Therefore, the proposed SRC network is combined with faster R-CNN and YOLO respectively with the hope of achieving more efficient two-stage and single-stage detectors.

Optimization

The field of object detection has made significant advances riding on the wave of deep CNN. For further improvements, network training optimization, model refining, and multi-scale training are still necessary.

For vehicle detection problem in real-world scenes, negative examples (non-vehicle) are dominated, which unbalances the positive and negative training examples. Besides, hard examples appear much more often in vehicle detection problems due to the cluttered scenes in real applications. All this make network training inefficient since training deep convolutional network detectors typically requires hundreds of thousands of stochastic gradient decent (SGD) steps. To address these problems, Shrivastava et al.²⁷ proposed a method named OHEM, in which the loss values of all region proposals in an input image are calculated by the current classifier and only examples with the largest losses are picked for a minibatch. Therefore, by taking this reweighting scheme, OHEM can improve training efficiency as well as accuracy compared with heuristic methods.

In this article, we make use of OHEM algorithm modifying the loss layers to implement hard example selection. For RoIs generated from RPN, the loss layers sort them by computing loss for hard example searching and set non-hard RoIs to be 0. By this way, OHEM finally leads to a faster and more stable network.

From the perspective of practical application, two-stage detectors are always lack of efficiency due to expensive computational costs and intensive memory. Model compression technique can be a potential solution. In our work, parameter pruning²⁸ is applied for complexity reduction and over-fitting alleviation. After the connectivity via network training is learnable, the weak connections under a certain threshold are removed from the network. Then we retrain the network to learn the final weights for the remaining sparse connections. The number of parameters of original network can be reduced by about 10 times after several times of operations.

Quantization and weight sharing²⁸ are also introduced for further compressing the pruned network, which reduces the number of bits used to represent each weight. Specifically, all the weights quantized in the same bin share the same value, so the storage of parameters is taken placed by small indexes. Then the original 32-bit weights can be replaced with 2-bit units which are indexes of the codebook corresponding to the centroid of their cluster. Thus, multiple weights can share the same values. The weights fall into the same cluster are modified and fully trained to approximate to the weights of original network.

To accomplish the goal of multi-scale vehicle detection, we employ multi-scale training scheme proposed in the literature^15,36 to make network more robust. Simply to say, only a portion of images in data set are selected, then the selected images are resized in certain scales to make the images with sizes from 256 × 256 to 608 × 608. Finally, some of the resized images are chosen randomly for training.

Experimental results and discussion

Data sets

In this section, we evaluate the existing typical detectors as well as the proposed method. Data sets Pascal VOC 0712 and MS COCO are widely applied in object detection evaluation.³⁷ Data set Pascal VOC 0712 covers 20 categories closing to real-world application with many difficult samples, and it creates the precedent for standardized evaluation of object detection competitions.

Data set MS COCO consists of images with complex everyday scenes, which contain more instances of objects and richer annotation information. It has become the most widely used data set for generic object detection.

Both of Pascal VOC 0712 and MS COCO contains multi-scale vehicles in natural environments with multi-categories, such as car, bus, truck, motor and person. Therefore, the two data sets are adopted in our evaluation experiments.

Pascal VOC 07 trainval and Pascal VOC 12 trainval are used for training, while Pascal VOC 07 test set is applied for evaluation. As for MS COCO, there are 80k train set and 40k validation set. Following a common practice, we further split the 40k validation set into 35k $v a l - \min u s \min i$ data set and 5k $\min i - v a l i d a t i o n$ data set. The 35k $v a l - \min u s \min i$ data set combined with the 80k train set to construct a totally 115k train samples.

Evaluation criteria

Generally, there are three criteria for evaluating the performance of detection algorithms: detection speed (frames per second (FPS)), precision, and recall. The most commonly used metric is $A v e r a g e P r e c i s i o n (A P)$ , derived from precision and recall.¹²

Since there are multi-category objects in detection, AP is often computed for each category of object separately and mean AP (mAP) averaged over all object categories is adopted as the final measure of performance.

The definitions of criteria adopted in different data sets are slightly different. Pascal-style AP takes a single IoU threshold of 0.5. For MS COCO data set, $A P_{s}$ , $A P_{m}$ , and $A P_{l}$ are adopted in common, which denotes the $A v e r a g e P r e c i s i o n (A P)$ of small, middle, and large scale of objects, respectively. $A P_{C O C O}$ means mAP calculated in standard COCO metrics (mean average precision under the IoU threshold from 0.5 to 0.95).

As said before, the proposed method can be utilized for generic object detection, but we focus on vehicle detection in this article. Therefore, we compute the AP for each vehicle category (such as motor, car, and bus) and overall average mAP for Pascal VOC 0712 data set. Following a common practice, $A P_{s V e h i c l e}$ , $A P_{m V e h i c l e}$ , $A P_{l V e h i c l e}$ , and overall average mAP are calculated for MS COCO data set.

Implementation details

Our detectors are end-to-end trained on a GTX 1080 GPU using SGD with a weight decay of 0.0001 and momentum of 0.9. Learning rate is set to 0.001 at the beginning of the training process and then decreased by a factor of 0.1 every 50k iterations after 90k iterations. The batch size is optimally set to 16 by an experimental way.

Experimental results

For comparison of different vehicle detectors based on CNNs, we conduct a series of comparative experiments for evaluating the performances of existing frameworks as well as the proposed SRC method using the same image data sets and training schemes.

As for two-stage detectors, we first run the experiments to investigate the existing typical networks with different backbones and find that methods with backbone of ResNet101 perform better than ZF and VGG, while ZF consumes less computation costs due to fewer parameters. Both R-FCN¹⁰ and FPN¹² achieve good performances by modifying faster R-CNN in terms of region proposal and feature representation. Taking advantages of feature pyramids, FPN method obtains the best results, especially on smaller vehicle detection.

Based on the above observations, we combine the proposed SRC network with faster R-CNN of ResNet101 to build a two-stage detector. As the SRC network consists of SC and RC network, experiments are conducted separately to explore the impact of each part on detection performance, which are denoted as faster R-CNN+SC, faster R-CNN+RC, and faster R-CNN+SRC, respectively. The overall experimental results are shown in Tables 2 and 3.

Table 2.

Evaluation results of two-stage detectors on Pascal VOC testset.^a

Approach	Network	Backbone	Motor	Car	Bus	mAP	FPS
Two-stage	Faster R-CNN	ZF	69.1	73.9	68.4	69.1	18
	Faster R-CNN	VGG	77.5	69.9	85.8	78.8	7
	Faster R-CNN	ResNet101	81.1	79.6	83.1	80.6	5
	R-FCN	ResNet101	82.3	78.5	86.0	83.3	3.3
	Faster R-CNN + FPN	VGG	81.7	85.7	83.3	81.7	4.4
	Faster R-CNN + FPN	ResNet101	83.4	87.1	85.5	83.8	4.0
	Faster R-CNN + SC	ResNet101	79.1	77.4	80.9	78.9	12.2
	Faster R-CNN + RC	ResNet101	88.3	90.8	87.8	88.4	3.2
	Faster R-CNN + SRC	ResNet101	87.9	89.4	87.0	87.0	10.6

CNN: convolutional neural network; FCN: fully convolutional network; FPN: feature pyramid network; mAP: mean average precision; FPS: frames per second; SC: separable convolution; RC: reverse connected; SRC: separable reverse connected.

^aThe bold values emphasize the best results achieved.

Table 3.

Evaluation results of two-stage detectors on MS COCO testset.^a

Approach	Network	Backbone	AP _sVeh	AP _mVeh	AP _lVeh	mAP	AP _COCO	FPS
Two-stage	Faster R-CNN	ZF	5.7	24.9	34.2	25.3	20.1	18
	Faster R-CNN	VGG	7.7	26.4	37.1	28.7	24.2	7
	Faster R-CNN	ResNet101	7.8	27.3	37.9	31.4	24.9	5
	R-FCN	ResNet101	7.2	28.1	39.0	33.8	25.3	3.3
	Faster R-CNN + FPN	VGG	10.9	26.9	38.5	32.3	24.9	4.4
	Faster R-CNN + FPN	ResNet101	11.1	29.5	38.7	34.6	26.2	4.0
	Faster R-CNN + SC	ResNet101	7.6	27.0	36.8	30.3	24.1	12.2
	Faster R-CNN + RC	ResNet101	16.7	34.1	44.1	39.8	30.6	3.2
	Faster R-CNN + SRC	ResNet101	15.1	32.4	42.3	38.5	28.7	10.6

CNN: convolutional neural network; FCN: fully convolutional network; FPN: feature pyramid network; AP: average precision; mAP: mean average precision; AP: average precision; FPS: frames per second; SC: separable convolution; RC: reverse connected; SRC: separable reverse connected.

^aThe bold values emphasize the best results achieved.

From the results, we can see that after combining with SC, detection speed of faster R-CNN network has been increased by more than two times with slight loss of accuracy, which demonstrates the ability of SC on computation cost reduction. The RC network significantly improves detection performance of faster R-CNN, while detection speeds are almost the same. The improvements are benefited from the RC structure which enriches the semantic context information of previous layers. Further, the proposed SRC combined with faster R-CNN achieves the best performance both in detection accuracy and speed. Specifically, the proposed SRC detector increases mAP by six points on Pascal VOC and seven points on COCO data sets. APs of small vehicle detection are higher than those of faster R-CNN by six points on Pascal VOC and seven points on COCO data sets, respectively. The solid and consistent detection improvements demonstrate the effectiveness of the proposed SRC network. Some detection examples are shown in Figures 5 and 6.

Figure 5.

Detection results of faster R-CNN + SRC on Pascal VOC testset. CNN: convolutional neural network.

Figure 6.

Detection results of faster R-CNN + SRC on COCO testset. CNN: convolutional neural network; SRC: separable reverse connected.

As for typical single-stage detectors, SSD, YOLO, and YOLOV2 with different backbones are investigated. It is shown that YOLOv2 with Darknet achieves the best detection accuracy on Pascal data set and slightly worse than SSD500 on COCO data set, while the detection speed of YOLOV2 is almost two times of other single-stage detectors. Consequently, we combine the proposed SRC network with YOLOV2.

Tables 4 and 5 give the experimental results of single-stage detectors. The SRC method increases over the best results by five points and two points on Pascal and COCO data set, respectively. On COCO data set, the proposed method increases AP of small vehicle detection by five points (13.8 vs. 8.7), mostly because of the powerful feature representation ability of SRC. It can be noted that the proposed SRC single-stage detector largely increases detection accuracy, while detection speed is still competitive. It can be seen that performances on detecting car and bus are improved more than that on motor. The reasons are considered as that SC is mainly designed for parameter reducing so as to save computation costs. Some detailed features contained in small objects, such as motor, may be neglected, while those contained in larger objects survive. Therefore, although performance of detecting motor is improved by the SRC, improvement is less than those on detecting car and bus.

Table 4.

Evaluation results of single-stage detectors on Pascal VOC testset.^a

Approach	Network	Backbone	Motor	Car	Bus	mAP	FPS
Single-stage	SSD 300	VGG	80.6	80.8	81.1	79.3	46
	SSD 500	VGG	80.9	85.6	84.9	82.4	19
	YOLO	GoogLeNet	71.3	55.9	68.3	64.6	45
	YOLOv2	Darknet	83.4	76.5	79.8	80.3	81
	YOLOv2 + SC	Darknet	79.0	75.5	77.6	78.4	88
	YOLOv2 + RC	Darknet	84.9	85.6	87.0	85.8	18
	YOLOv2 + SRC	Darknet	83.8	84.8	85.3	84.5	34

mAP: mean average precision; FPS: frames per second; SSD: single shot multi-box detector; SC: separable convolution; RC: reverse connected; SRC: separable reverse connected.

^aThe bold values emphasize the best results achieved.

Table 5.

Evaluation results of single-stage detectors on MS COCO testset.^a

Approach	Network	Backbone	AP _sVeh	AP _mVeh	AP _lVeh	mAP	AP _COCO	FPS
Single-stage	SSD 300	VGG	5.3	23.2	39.6	25.1	23.2	46
	SSD 500	VGG	8.0	28.9	40.6	29.8	23.7	19
	YOLO	GoogLeNet	4.3	19.3	31.2	23.3	19.8	45
	YOLOv2	Darknet	8.7	22.4	35.5	28.2	23.4	81
	YOLOv2+ SC	Darknet	7.0	21.2	32.8	25.6	21.6	88
	YOLOv2+ RC	Darknet	14.3	34.0	43.9	36.6	28.7	18
	YOLOv2+ SRC	Darknet	13.8	33.7	41.5	34.8	25.7	34

AP: average precision; mAP: mean average precision; FPS: frames per second; SSD: single shot multi-box detector; SC: separable convolution; RC: reverse connected; SRC: separable reverse connected.

^aThe bold values emphasize the best results achieved.

As an independent network, the proposed SRC network can be easily combined with either two-stage or single-stage detector. The resulting two-stage detector achieves higher detection accuracy with higher speed, while the resulting single-stage one significantly improves detection accuracy with competitive speed. Some detection examples of single-stage YOLOV2 + SRC are shown in Figures 7 and 8.

Figure 7.

Detection results of YOLOv2 + SRC on Pascal VOC testset. SRC: separable reverse connected.

Figure 8.

Detection results of YOLOv2 + SRC on COCO testset. SRC: separable reverse connected.

As described in the previous section, optimization methods, such as multi-scale training, OHEM,²⁷ and model compression,²⁸ are applied in our method. To make clear of the effectiveness of each optimization technique, we run several experiments on Pascal and COCO data sets. FSRC is used to represent the original version of the proposed SRC method, FSRC+ to represent the original version applied with multi-scale training, and FSRC++ to represent the original version applied with multi-scale training and OHEM. Besides, FSRC++A and FSRC++B denote the cases of application of model compression with different intensities.

The experimental results of two-stage detector on optimization are presented in Tables 6 and 7. It can be seen that the optimized SRC networks improve the detection accuracy of the original SRC method by about 1% mAP of vehicles on Pascal VOC0712 and about 2% mAP of vehicles on MS COCO. It is noted that the application of model compression significantly increases detection speed, which makes the proposed method be capable for real-time vehicle detection.

Table 6.

Multi-scale training, OHEM algorithm, and model compressing applied to the SRC of two-stage detector on Pascal VOC testset.^a

Network	Backbone	Car	Motor	Bus	mAP	FPS
FSRC	ResNet101	86.7	83.9	85.2	84.0	10.6
FSRC+	ResNet101	88.1	85.4	85.9	85.2	10.6
FSRC++	ResNet101	89.4	87.9	87.0	87.1	10.6
FSRC++: A	ResNet101	88.9	87.1	86.8	86.7	28
FSRC++: B	ResNet101	88.0	86.9	86.4	86.0	35

OHEM: online hard example mining; SRC: separable reverse connected; mAP: mean average precision; FPS: frames per second.

^a A and B indicate the different intensity of model compressing. The bold values emphasize the best results achieved.

Table 7.

Multi-scale training, OHEM algorithm, and model compressing applied to the SRC of two-stage detector on MS COCO testset.^a

Network	Backbone	AP _sV	AP _mV	AP _lV	mAP	FPS
FSRC	ResNet101	13.2	29.6	38.4	34.9	10.6
FSRC+	ResNet101	13.5	30.1	39.6	36.8	10.6
FSRC++	ResNet101	15.1	32.4	42.3	38.5	10.6
FSRC++: A	ResNet101	14.9	31.8	42.0	38.2	28
FSRC++: B	ResNet101	14.1	30.2	41.4	37.7	35

OHEM: online hard example mining; SRC: separable reverse connected; AP: average precision; mAP: mean average precision; FPS: frames per second.

^a A and B indicate the different intensity of model compressing. The bold values emphasize the best results achieved.

Single-stage detectors have the advantages of simple structure and high detection speed, but detection accuracy is not satisfied. In this case, only multi-scale training has been applied to the proposed single-stage detector, denoted as YSRC+. Besides, YSRC++A and YSRC++B denote the cases of application of model compression with different intensity. Tables 8 and 9 give the comparison results on Pascal VOC and MS COCO, respectively. It is shown that the application of multi-scale training significantly improves the detection accuracy of the SRC single-stage detector. Applying model compression technique to single-stage SRC detector, detection speed is accelerated by two times, while detection accuracy decreases to some degree. Since single stage-detector already has an advantage in speed, more efforts should be put on improving detection accuracy.

Table 8.

Multi-scale training and model compressing applied to SRC of single-stage detector on Pascal VOC testset.^a

Network	Backbone	Car	Motor	Bus	mAP	FPS
YSRC	Darknet	84.3	83.8	85.3	84.3	34
YSRC+	Darknet	84.7	85.1	87.6	86.0	34
YSRC+: A	Darknet	82.1	80.2	84.4	82.2	68
YSRC+: B	Darknet	79.4	77.8	82.4	79.9	95

SRC: separable reverse connected; mAP: mean average precision; FPS: frames per second.

^aThe bold values emphasize the best results achieved.

Table 9.

Multi-scale training and model compressing applied to SRC of single-stage detector on MS COCO testset.^a

Network	Backbone	AP _sV	AP _mV	AP _lV	mAP	FPS
YSRC	Darknet	11.3	28.6	40.9	31.7	34
YSRC+	Darknet	13.8	33.7	41.5	34.8	34
YSRC+: A	Darknet	10.6	26.4	39.1	29.4	68
YSRC+: B	Darknet	8.1	21.3	36.9	26.7	95

SRC: separable reverse connected; AP: average precision; mAP: mean average precision; FPS: frames per second.

^aThe bold values emphasize the best results achieved.

Conclusions

We have proposed a CNN-based framework, called SRC network, for multi-category and multi-scale vehicles detection. The RC structure enriches the semantic context information of previous layers, while SC is introduced for sparse representation of heavy feature maps generated from subnetworks. Multi-scale training, OHEM strategy, and model compression technique are applied for improving and accelerating detection task. The proposed method shows significant improvements over several strong baselines. By SC and model compression, the network of two-stage detector is accelerated by about two times with little loss of detection accuracy. The resulting single-stage detector largely increases detection accuracy, while detection speed is still competitive. Finally, despite the strong feature representation ability achieved by SRC, it is still critical to explore the solution to multi-scale detection problems.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Linlin Huang

References

Dalal

Triggs

. Histograms of oriented gradients for human detection. In: IEEE Computer Society conference on computer vision and pattern recognition, CVPR, San Diego, CA, USA, 20–25 June 2005, pp. 4278–4284. IEEE.

Lowe

. Distinctive image features from scale-invariant keypoints. Int J Comput Vis 2004; 60(2): 91–110.

Lienhart

Maydt

. An extended set of Haar-like features for rapid object detection. In: Proceedings of the 2002 international conference on image processing, ICIP 2002, vol. 901, Rochester, New York, USA, 22–25 September 2002, pp. I-900–I-903. IEEE.

Uijlings

Van De Sande

Gevers

, et al. Selective search for object recognition. Int J Comput Vis 2013; 104(2): 154–171.

Zitnick

Dollár

. Edge boxes: locating object proposals from edges. In: Computer vision—ECCV 2014—13th European conference (eds Fleet

Pajdla

Schiele

Tuytelaars

), Zurich, Switzerland, 6–12 September 2014, pp. 391–405. Cham: Springer.

Krizhevsky

Sutskever

Hinton

, et al. ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems 25: 26th Annual conference on neural information processing systems 2012 (eds Bartlett

Pereira

FCN

Burges

CJC

Bottou

Weinberger

), Lake Tahoe, Nevada, USA, 3–6 December 2012, pp. 1097–1105. USA: Curran Associates Inc.

Girshick

Donahue

Darrell

, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE conference on computer vision and pattern recognition, CVPR 2014, Columbus, OH, USA, 23–28 June 2014, pp. 580–587. IEEE.

Girshick

. Fast R-CNN. In: IEEE conference on computer vision and pattern recognition, Massachusetts, Boston, USA, 8–10 June 2015. Piscataway, NJ: IEEE.

Ren

Girshick

, et al. Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015 (eds Cortes

Lawrence

Lee

Sugiyama

Garnett

), Montreal, Quebec, Canada, 7–12 December 2015, pp. 91–99.

10.

Dai

, et al. R-FCN: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems 29: Annual conference on neural information processing systems 2016, Barcelona, Spain, 5–10 December 2016, pp. 379–387.

11.

Dai

Xiong

, et al. Deformable convolutional networks. CoRR, abs/170306211 2017; 1(2): 3.

12.

Lin

Dollár

Girshick

, et al. Feature pyramid networks for object detection. In: IEEE conference on computer vision and pattern recognition, CVPR 2017, vol. 1, Honolulu, HI, USA, 21–26 July 2017, pp. 936–944.

13.

Liu

Anguelov

Erhan

, et al. SSD: single shot multibox detector. In: Computer vision—ECCV 2016—14th European conference, Amsterdam, The Netherlands, 11–14 October 2016, pp. 21–37. Berlin, German: Springer.

14.

Redmon

Divvala

Girshick

, et al. You only look once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 779–788. Piscataway, NJ: IEEE.

15.

Redmon

Farhadi

. YOLO9000: better, faster, stronger. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 6517–6525. Piscataway, NJ: IEEE.

16.

Gkioxari

Dollár

, et al. Mask R-CNN. In: IEEE international conference on computer vision, ICCV 2017, Venice, Italy, 22–29 October 2017, pp. 2980–2988. Piscataway, NJ: IEEE.

17.

Zhu

Wang

Dai

, et al. Flow-guided feature aggregation for video object detection. In: IEEE international conference on computer vision, ICCV 2017, vol. 3, Venice, Italy, 22–29 October 2017, pp. 408–417.

18.

Peng

Xiao

, et al. MegDet: a large mini-batch object detector. arXiv preprint arXiv:171107240 , 2017; 7.

19.

Peng

, et al. Light-head R-CNN: in defense of two-stage object detector. arXiv preprint arXiv:171107264 , 2017.

20.

Sandler

Howard

Zhu

, et al. Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation. arXiv preprint arXiv:180104381 , 2018.

21.

Huang

Liu

Van Der Maaten

, et al. Densely connected convolutional networks. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, vol. 1, Honolulu, HI, USA, 21–26 July 2017, pp. 2261–2269. Piscataway, NJ: IEEE.

22.

Yuan

Cao

Hao

, et al. Vehicle detection by a context-aware multichannel feature pyramid. IEEE Trans Syst Man Cybern: Syst 2017; 47(7): 1348–1357.

23.

Zhang

Gao

Xue

, et al. Real-time vehicle detection and tracking using improved histogram of gradient features and Kalman filters. Int J Adv Robot Syst 2018; 15(1): 1729881417749949.

24.

Pae

Choi

Kang

, et al. Vehicle detection framework for challenging lighting driving environment based on feature fusion method using adaptive neuro-fuzzy inference system. Int J Adv Robot Syst 2018; 15(2): 1729881418770545.

25.

Kong

Sun

Yao

, et al. Ron: reverse connection with objectness prior networks for object detection. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, vol. 1, Honolulu, HI, USA, 21–26 July 2017, pp. 5244–5252. Piscataway, NJ: IEEE.

26.

Liang

Wei

, et al. Perceptual generative adversarial networks for small object detection. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 1951–1959. Piscataway, NJ: IEEE.

27.

Shrivastava

Gupta

Girshick

. Training region-based object detectors with online hard example mining. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 761–769.

28.

Han

Mao

Dally

. Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:151000149 , 2015.

29.

Huang

Rathod

Sun

, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, vol. 4, Honolulu, HI, USA, 21–26 July 2017. Piscataway, NJ: IEEE.

30.

Simonyan

Zisserman

. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556 , 2014.

31.

Xie

Girshick

Dollár

, et al. Aggregated residual transformations for deep neural networks. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 5987–5995. Piscataway, NJ: IEEE.

32.

Szegedy

Liu

Jia

, et al. Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, 7–12 June 2015, pp. 1–9.

33.

Szegedy

Vanhoucke

Ioffe

, et al. Rethinking the inception architecture for computer vision. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 2818–2826. Piscataway, NJ: IEEE.

34.

Szegedy

Ioffe

Vanhoucke

, et al. Inception-v4, inception-ResNet and the impact of residual connections on learning. In: Proceedings of the 31st AAAI conference on artificial intelligence (eds SP

Singh

Markovitch

), vol. 4, San Francisco, California, USA, 4–9 February 2017, pp. 4278–4284.

35.

Chollet

. Xception: deep learning with depthwise separable convolutions. arXiv preprint , 2017; 4: 4278–4284.

36.

Singh

Najibi

Davis

. SNIPER: efficient multi-scale training. In: Advances in neural information processing systems 31: Annual conference on neural information processing systems 2018, NIPS 2018 (eds Bengio

Wallach

Larochelle

Grauman

Cesa-Bianchi

Garnett

), Montréal, Canada, 3–8 December 2018, pp. 9333–9343.

37.

Liu

Ouyang

Wang

, et al. Deep learning for generic object detection: a survey. arXiv preprint arXiv: 1809.02165, 2018.