Sage Journals: Discover world-class research

Abstract

Many pieces of information are included in the front region of a vehicle, especially in windshield and bumper regions. Thus, windshield or bumper region detection is making sense to extract useful information. But the existing windshield and bumper detection methods based on traditional artificial features are not robust enough. Those features may become invalid in many real situations (e.g. occlude, illumination change, viewpoint change.). In this article, we propose a multi-attribute-guided vehicle discriminately region detection method based on convolutional neural network and not rely on bounding box regression. We separate the net into two branches, respectively, for identification (ID) and Model attributes training. Therefore, the feature spaces of different attributes become more independent. Additionally, we embed a self-attention block into our framework to improve the performance of local region detection. We train our model on PKU_VD data set which has a huge number of images inside. Furthermore, we labeled the handcrafted bounding boxes on 5000 randomly picked testing images, and 1020 of them are used for evaluation and 3980 as the training data for YOLOv3. We use Intersection over Union for quantitative evaluation. Experiments were conducted in three different latest convolutional neural network trunks to illustrate the detection performance of the proposed method. Simultaneously, in terms of quantitative evaluation, the performance of our method is close to YOLOv3 even without handcrafted bounding boxes.

Keywords

Vehicle local region detection multi-attribute self-attention

Introduction

With the development of automation and intelligence, intelligent transportation systems and security surveillance systems are widely used in various fields. In those systems, the information of vehicles is inevitably needed to process due to different requirements (e.g. vehicle model recognition,^1

–4 vehicle re-identification (Re-ID) task,^5
–7 license plate recognition,^8,9 occupant violations^10
–12). Psyllos et al.¹ utilize license plate recognition to locate the bumper region and then use a set of scale-invariant feature transform (SIFT) descriptors to describe the region information to achieve the model recognition. Fang et al.² process the inputs by a coarse-to-fine convolutional neural network (CNN) to obtain the discriminative parts. Specifically, they rely on global and local parts combining feature to vehicle model classification. Additionally, a one-versus-all support vector machine classifier is applied for classification. He et al.⁵ obtain the part features by a LocalNet (use YOLO in the experiment) to distinguish the subtle discrepancy. After that, part features concatenate with the global feature as a discriminative feature for the vehicle Re-ID task. Silva and Jung⁸ recognize the license plate by CR-NET. CR-NET is similar to YOLO but has better nonlinearity. Artan et al.¹⁰ consider 13 handcraft labeled landmark points to detect the windshield region through the model, which is based on a mixture of trees with a shared pool of parts. Three interesting regions are drawn according to their geometric relationship with the windshield region to detect the violations (e.g. front seat occupancy, driver cell phone usage, and seat belt violation detections) after the windshield localization.

Generally, the information in the windshield and bumper regions is enough for most vehicle tasks. Those regions contain abundant information (e.g. passengers, annual inspection marks, logo, and license plate), as shown in Figure 1. Our brain usually pays attention to the information which is personally interesting.¹³ This is also in accordance with the human visual system. Therefore, usually we do not have to pay the computing costs for global information. In most vehicle missions, we only need to use the abovementioned information to achieve the goal. For example, annual inspection marks or unique sundries can provide discrimination information in vehicle Re-ID and vehicle retrieval tasks. Additionally, we also have reasons to believe that the bumper region and logo are more helpful for vehicle model recognition. At the same time, there is no doubt that license plate recognition only relies on the license plate region which is in the bumper region. For the occupant violations detection system, we should pay more attention to the windshield region. The status of occupants can be displayed through the windshield to further recognize the violations.¹⁰

Figure 1.

For a vehicle, bumper and windshield regions include most information shown as the red dotted box in the figure. Furthermore, some fine-grain information also marked up by different color dotted boxes. Passenger and driver regions adopt occupant violations detection. Annual inspection marks and sundries regions adopt vehicle Re-ID and retrieval, logo and bumper regions adopt vehicle model recognition, and license plate region adopts license plate recognition. Re-ID: re-identification.

Therefore, a robust windshield and bumper regions detection method can bring positive effects for many tasks, and if it does not rely on handcraft bounding box ground-truth and locates regions simultaneously, then the method will become more general. Although many state-of-the-art works have proposed some methods for the windshield or bumper detection, there is a lack of methods that can detect both regions simultaneously. There are some methods based on the traditional features^11,14 to generate the bounding box, but due to environmental changes, their efficiency and robustness are insufficient. Yuan et al.¹¹ segment the image by mean-shift firstly. Then the traditional method, Hough transform, is used to extract the windshield region. Finally, the histograms of oriented gradients are used as the feature to detect occupants. These traditional methods are computationally expensive or not robust. Artan et al.¹⁰ combined traditional methods to generate windshield region bounding boxes and used manual annotations to generate windshield models. Similarly, Mountelos et al.¹⁵ propose a fully convolutional network to segment the windshield region, but the model has to fine annotate the data set. Psyllos et al.¹ and Lee et al.³ detect license plates first and then use the geometric relationship between the license plate region and the bumper region to figure out the bumper region to achieve the model recognition. These methods are sensitive to the position of the license plate and are easy to fail when the license plate is in a special position. Silva and Jung⁸ identify the bumper region and the license plate region by two passes architecture CNN to achieve bumper detection. The method is similar to YOLO. It also needs to label the bounding box for training in regression iteration processing. Although Fang et al.² generate the bumper region by the feature map-based location and also consider the multi-grain information, they only pay attention to the bumper region for vehicle model recognition. Neither traditional methods nor CNN-based methods focus on multi-attribute-based multi-region detection. For vehicle Re-ID and retrieval tasks, windshield region and bumper region can extract discrimination features for different attributes to enhance the identification accuracy of different objects. Furthermore, the two regions also provide most of the multi-attribute and multi-grain information. Moreover, if local regions can be detected preferentially, the redundant information will be shielded in the subsequent processing and the filtered information can be more efficient and accurate to describe the feature difference.

Therefore, we propose a dual-branch-based architecture to detect vehicle local regions. Moreover, we also embed a self-attention block to enhance the independence of different attributes.

The main contributions of our method are as follows:

We use a dual-branch CNN architecture adopt to the multi-attribute and train our model based on multi-attribute labels to obtain the attention map of different attributes in order to detect multi-local regions simultaneously.

We propose a positioning method based on multistage pooling, which makes our multi-region detection unnecessary to label annotations. In other words, our method becomes an adaptive detection guided by feature maps.

We embed the self-attention block into our framework to make multi-attribute features more independent. Furthermore, we also handcraft annotate 5000 randomly picked testing images; 1020 images for evaluation and the rest for YOLOv3 ¹⁶ training.

The experimental results show that our method achieves significant performance for multi-region detection on vehicles. In addition, the obtained Intersection over Union (IoU) values are close to that of YOLOv3.

The rest of the article is structured as follows: The second section details the proposed method in this article. The third section shows the experiments and compares the performance of our framework under different CNN architectures. Additionally, we compare ResNet-50-based framework with the handcraft annotation-based object method YOLOv3 and offer the result. The fourth section draws the conclusion.

Proposed method

Our method pays attention to the difference between feature maps guided by multi-attribute. Moreover, through our dual-branch architecture, we ensure the generated feature spaces have different regionalities. In other words, different attributes have different local sensitivity. Moreover, we reinforce the feature centralization by the embedded self-attention block and improve the performance. Then, the feature maps are generated from different attribute branches, and both feature maps are input into the proposed positioning method, so as to achieve adaptive acquisition of the bounding box. Furthermore, we pay attention to the preference regions of ID and Model label. The preferred regions for those labels are the bumper and windshield regions, both of which are invisible from the side, so we do experiments on PKU_VD ¹⁷ data set where all the images are front views. We will detail our method from four parts: dual-branch architecture, loss function, self-attention block, and the proposed position algorithm.

Dual-branch architecture

Currently, many state-of-the-art methods often make use of multi-attribute,^17

–20 but they lack attributes-guided multi-branch architecture. In multi-attribute tasks, a single-wire structure network can only generate one feature corresponding to each sample, but it must be associated with multi-attribute labels at the same time. For example, a sample and an anchor sample have the same model but different IDs. If we employ triplet losses in a single architecture network, the pair of features should be closer in model classification triplet loss and farther in ID classification triplet loss. The distance between the same pair of features cannot be increased and decreased at the same time. Inspired by some multi-branch architectures,^17,21 we propose a multi-attribute-guided dual-branch architecture. Different attributes have the same feature space under shallow, but they may have completely different feature spaces under depth. So we emphasize the transformation of the architecture. The features of different attributes become more independent and more consistent with their respective labels through the multi-branch architecture in our work. Hence, we separate our framework into dual-branch at the deep layer, as shown in Figure 2.

Figure 2.

The pipeline about the proposed method. Gray trapezoids denote CNN downsample blocks, gray rectangles denote self-attention blocks, blue rectangles denote mathematical operators, and brown rectangle denotes positioning block. CNN: convolutional neural network.

It is necessary to declare that the gray trapezoid denotes the downsample block, the gray rectangle denotes self-attention block, and the blue rectangle denotes mathematical operator. We compare the performance of different operators in the experiment. The brown rectangle denotes the proposed positioning method. Our method feeds the input to a parameter-sharing block and then uses the output of the parameter-sharing block as inputs to different attribute branches (ID and Model are used in our experiments.). After the corresponding self-attention blocks and operators, these features are converted into heatmaps. Finally, by processing the heatmaps through the proposed positioning block, detection boxes corresponding to different attributes can be obtained.

In the training stage, we leverage multi-attribute and aim to separate the features as much as possible, so we combine the triplet loss and cross-entropy loss as the cost function. Many works prove that triplet loss is suitable for obtaining independent features.^17,19,22,23 Additionally, we sort the input as $G = \{g_{1}, g_{2}, \dots, g_{n}\}$ to facilitate the use of triplet loss, where the images in the same g have the same attribute label.

Loss

Multi-attribute information is employed in our method. The Model and ID labels guide the training. Generally, there is no interference between different attributes, thus the contradiction caused by the interference is always ignored. In our case, Model and ID have interference. The triplet loss as part of the cost function contains the relationship among anchor sample, positive sample, and negative sample. It can make the features generated by the same attribute appear closer but the features generated by different attributes farther. When there exists an interference between different attributes, the conflict will appear. An example is shown in Figure 3. For instance, vehicles have the same Model attribute but different ID attributes. It means that although the positive samples have the same Model attribute with the anchor, they still may have different ID attributes with the anchor. In Model triplet loss, the distance from the anchor sample should be small, but in ID triplet loss the distance should be large. But an input image can only generate a corresponding feature through the CNN, so it will cause conflicts under multi-attributes. Therefore, we propose a dual-branch architecture to resist the conflict.

Figure 3.

The visualization of the conflict. The left image has the same ID with anchor image and the right image only has the same model with anchor image. In ID triplet loss, ℓ₁ tends to be smaller and ℓ₂ tends to be bigger. In Model triplet loss, ℓ₁ also has the same tendency but ℓ₂ tends to be smaller. ℓ₂ has different tendencies in different attribute triplet losses. That is unreasonable.

For classification, we employ the cross-entropy with a label smoothing regularizer²⁴ as the loss. We can formulate it as follows

{Loss}_{c} = - ((1 - ε) \times t + \frac{ε}{N}) \sum_{i = 1}^{N} \times log \frac{exp (p_{i})}{\sum_{n = 1}^{N} exp (p_{n})}

where ε is a smoothing parameter, t equals 1 if the label is the ground-truth, N is the number of classifications, and p_n denotes the logits or unnormalized log probabilities of the nth classification for the sample.

For triplet loss, we divide the input into three groups: $g^{a}, g^{p}, and g^{n}$ , which denote the anchor image, positive group, and negative group, respectively. The triplet loss is inspired by Hermans et al.²⁵ and is defined as

{Loss}_{t} =max (0, - (d_{min}^{n} - d_{max}^{p}) + margin)

where $d_{min}^{n}$ denotes the minimum value of dⁿ , $d_{max}^{p}$ denotes the maximum value of d^p , and dⁿ and d^p are the distances calculated according to the following formulas

\begin{array}{l} d^{n} = ∥ f (g^{a}) - f (g^{n}) ∥ \\ d^{p} = ∥ f (g^{a}) - f (g^{p}) ∥ \end{array}

where $f (*)$ denotes the feature is generated from CNN. Finally, the overall loss can be formulated as

Loss = {Loss}_{c} + ω \times {Loss}_{t}

where ω denotes the weight coefficient.

Self-attention block

Inspired by some state-of-the-art works about self-attention in computer vision,^26
–28 we embed the self-attention block²⁹ into our framework. The framework about the self-attention block is shown in Figure 4. The attention mechanism originates from the human visual system. Humans often focus on the fine-grained local information of the fuzzy space range through the coarse-grained prior information of the global image. We employ the self-attention to calculate the weight of each feature by training.

Figure 4.

The framework about the self-attention block. $F$ denotes the input feature and O denotes the output heatmap. The blue rectangles denote convolution layers with $1 \times 1 \times 1$ kernel size. The buff circulars denote cross product operations.

Firstly, we map the input feature map through three convolution layers with $1 \times 1 \times 1$ kernel size, respectively. Then three new feature maps are generated. In addition, we reduce the number of channels of all feature maps to the same size through the convolution layer. Secondly, we perform a cross product operation on two new feature maps and generate their correlation matrix. Thirdly, we generate the heatmap through $s o f t max$ . Finally, we do the cross product operation between the heatmap and the rest feature map to complete the transformation of the input.

In our experiments, we compare the performance of two different $F$ and O combination strategies. These two strategies can be formulated as

\begin{array}{l} F' = (O + 1) \times F \\ F' = O \times F \end{array}

Then the results of both combination strategies will be shown in the “Experiments” section. We will also show a comparison of self-attention and without self-attention frameworks. We can see that heatmaps generated by the self-attention blocks become more concentrated. Furthermore, in the experiment, we also observed that the application of self-attention can slightly accelerate the convergence of the model in training.

Positioning

After we get the attention heatmaps, we locate the region by the proposed method called multistage pooling-based positioning (MSPP). The pseudocode of MSPP is shown in Algorithm 1, which aims to explain the mechanism of the part positioning.

The multi-attribute corresponding heatmaps are expressed as H_m and H_v , respectively. Both heatmaps are inputs of the MSPP. H_m corresponds to the Model branch, and H_v corresponds to the ID branch. For the two branches, their heatmaps are generated by combining their respective self-attention maps with $F$ , such as equation (5). Although the heatmaps obtained by the two branches are very similar, the corresponding region in the heatmap obtained by the ID branch will be more centralized and sparse. Therefore, we first calculate the position of the bumper and use a multi-scale strategy in the windshield region detection. We use fixed kernel size pooling layers because the data in the PKU_VD are all front-view images and have similar vehicle-to-background ratio.

Algorithm 1.

The framework of part positioning.

The part positioning block can implement region detection without handcraft bounding boxes. Moreover, the obtained regions have attributes preference.

Experiments

Data set and setup

We do our training and evaluation on PKU_VD data set. In PKU_VD, there are two subsets, VD1 and VD2. Both subsets are huge data sets and they have the same data structure. The training set of VD1 has 70,591 vehicle IDs and 1232 Models. The training set of VD2 has 39,619 vehicle IDs and 1112 Models. Moreover, the training set of VD1 has 422,326 images and VD2 has 342,608 images. Since VD1 has more images and classifications, it has more advantages in generating multi-attribute private functions. Therefore, we chose VD1 as the experimental data set. Furthermore, we strip out some IDs which have less than 12 samples. But that will not weaken the performance. On the contrary, it improves training efficiency. Finally, 160,000 images remain as our training set, which contains 8633 IDs and 863 models.

In the training stage, we employ ImageNet pretrained weights as our initial parameters to fine-tune our model. We set our batch size as 160. There are 16 randomly picked ID classes in one batch and each ID has 10 randomly picked images. In addition, we resize all images to 224 × 224 × 3 as inputs. And we normalize and flip inputs during data initialization in order to optimize the training and enhance the data. Furthermore, we manually set the learning rate. To ensure that the objective function can jump out the previous optimal point, we increase the learning rate follows equation (6) in the first 20 epochs. Then it drops to one-tenth every 60 epochs until the model converges. We set the margin of equation (2) to 0.3 and set ω of equation (4) to 1

l r = 1 e^{- 4} \times (epoch + 1) / 2

Besides, we randomly pick 5000 images from the testing set, and draw the handcraft bounding boxes, and then employ IoU as a quantitative evaluation function. Because we use fixed kernel size pooling layers in the positioning, we also draw fixed size handcraft bounding boxes to ensure fairness. It is reasonable because all images in the data set have a similar vehicle-to-background ratio. In the rest of this section, we will show the ablation experiments of self-attention block and instantiate the results of our positioning. Furthermore, we also compare the performance of our method with different state-of-the-art CNN network backbones. Additionally, we also compare our ResNet-50-based method with the deep learning method YOLOv3. IoU is employed as a quantitative evaluation index and we also show the detection results in instances. Finally, we show some failure cases and give the analysis.

Evaluation criteria

We employ IoU as the quantitative evaluation index in our evaluation. It is widely used to measure the accuracy of a method in the object detection task. The calculation of IoU is related to two sets of bounding boxes, the ground-truth bounding boxes and the predicted bounding boxes from the trained model. So we randomly pick 5000 images from the testing set of PKU_VD and draw their handcraft ground-truth bounding boxes to calculate IoU. We can formulate IoU as the following equation

I o U = \frac{p \cap g t}{p \cup g t}

where p denotes the predicted bounding box and gt denotes the ground-truth bounding box. In other words, IoU is a ratio of the overlapping area of p and gt to the total area.

In our experiments, each sample needs to detect two local regions, the bumper region and the windshield region. We calculate IoU for them and expressed as IoU_m and IoU_v , respectively. Then the overall IoU _all can be formulated as follows

I o U_{all} = I o U_{m} + λ \times I o U_{v}

where λ denotes the weight and set to 1. Finally, IoU _all is used as the evaluation function in our ablation and comparison experiments in the next two sections to convenience for quantitative comparison.

Ablation analysis

As we mentioned above, the dual-branch architecture can help the features of different attributes become more independent and more consistent, the self-attention block can let the generated heatmaps become concentrating and also can accelerate the training convergence speed. So we show the performance comparison of different architectures and show the influence of the self-attention block in this section. In addition, we compare the performance of two different self-attention combining strategies that are formulated in equation (5). Moreover, we embed the self-attention block into a suitable place of the framework in order to maintain the number of the parameters in a reasonable range. If the insertion position is too shallow, the self-attention block cannot act on multiple attributes due to our dual-branch architecture. Conversely, if the insertion position is too deep, the number of parameters will increase significantly due to the increase of feature dimensions. Through experiments, we also discover that different insertion positions of the self-attention block do not significantly influence results.

It should be noted that in ablation comparing, we employ ResNet-50 as our backbone. In dual-wire architecture, we embed the self-attention block and separate the framework after Cov4_x layers, and choose the outputs of the self-attention blocks from both branches as our comparative heatmaps. For fairness, in single-wire architecture, we also choose the feature map after Cov4_x as our comparative heatmap, as shown in Figure 5. Besides, we also calculate IoU_all values of those different networks for quantitative comparison and the results are shown in Table 1. We compare four different networks to prove the effectiveness of our dual-wire and self-attention block. As can be seen from the table, dual-wire architecture brings surprising improvement, and self-attention brings more concentrated heatmaps especially on the heatmap generated by the Model branch. From Figure 5, we can see that the heatmaps of the dual-wire O are more sparse, which may cause the obtained bounding box to deviate easily. On the other side, the heatmaps of dual-wire O + 1 are denser in the detected region, thus the bounding boxes will be more accurate and robust. Furthermore, the heatmap from the ID branch has higher heat values in the personalized local region at the windshield even both heatmaps are similar. Finally, we can make a conclusion from our results, the dual-wire O + 1 strategy is more satisfactory and it has the best performance. So we compare the performances of different backbones under dual-wire O + 1 strategy in the next section.

Figure 5.

The results of four different architectures. Single-wire architecture has one line, dual-wire architecture has two lines which have partly sharing parameter layers, dual-wire O architecture is embedded self-attention follows the second combination in equation (5), and dual-wire O + 1 is embedded self-attention follows the first combination in equation (5). The image on the far left is the input with ground-truth bounding boxes. The smaller images in the top row on the right are heatmaps from the ID branch, and the bottom row images are the heatmaps from the Model branch. The yellow rectangle is the region of the windshield which is obtained by our architecture, and the red rectangle is the region of the bumper.

Table 1.

Results of different architectures (IoU _all).

Method	IoU_m	IoU_v	IoU _all
Single	0.089	0.286	0.187
Dual	0.594	0.783	0.683
Dual O	0.783	0.797	0.785
Dual O + 1	0.808	0.814	0.811

IoU: Intersection over Union. The bold-faced value is the maximum value of each column and means the method has the best performance.

Comparison

In this section, we compare the performance of our method with different CNN backbones. Three state-of-the-art CNNs are used in the comparison, namely VGG16,³⁰ GoogLeNet,³¹ and ResNet-50.³² Although they are all CNN frameworks, they differ in terms of structural details, number of parameters, performance of different tasks, and so on.

We insert self-attention block into a similar place in three compared architectures in order to be fair. We insert the self-attention block after the 10th weight layer in VGG16. And we insert the self-attention block after inception module 4e in GoogLeNet. In ResNet-50, we insert the self-attention block after cov4_x. Meanwhile, we also choose the output after self-attention block as our heatmap of every architecture. The local region detection results are shown in Figure 6. And we also quantify the comparison through IoU_all and shown as Table 2.

Figure 6.

The local region detection results of the proposed method under different CNN backbones. First row images are the samples of handcraft ground-truth bounding boxes. Rest rows show the results when ResNet-50, GoogLeNet, and VGG16 as the backbone successively. CNN: convolutional neural network.

Table 2.

Results of different backbones.

Method	IoU_m	IoU_v	IoU _all
VGG16	0.699	0.510	0.605
GoogLeNet	0.635	0.395	0.515
ResNet-50	0.808	0.814	0.811

IoU: Intersection over Union. The bold-faced value is the maximum value of each column and means the method has the best performance.

As shown in Table 2 and Figure 6, ResNet-50 as the backbone network can always achieve the best performance. Although the ResNet-50-based architecture has the best quantitative evaluation and best instance performance, the computational complexity is higher than that based on GoogLeNet. Furthermore, VGG16-based architecture has a high computational complexity and the performance is worse than the other two. Nevertheless, our method still has robust and satisfactory performance in instance experiments, as shown in Figure 6.

We can give the conclusion that those three CNNs have obtained satisfactory regions detection with our multi-attribute self-attention guided dual-wire architecture. It can be seen from the evaluation function results that ResNet-50 has better results.

We also compared our method with the popular object detection method YOLOv3. But our method does not need to label the handcraft annotation. It costs a lot of resources. Compared with that we prefer to have a pure image data-driven object detection. Additionally, the essence of the two is also different. YOLOv3 achieves object detection through the feature similarity. But our method achieves the object detection through the feature discrimination. This is also due to the use of different type of labels as the training basis. We also provide the comparison results which are shown in Table 3.

Table 3.

Results of YOLOv3.

Method	IoU_m	IoU_v	IoU _all
YOLOv3	0.803	0.837	0.820
ResNet-50	0.808	0.814	0.811

IoU: Intersection over Union. The bold-faced value is the maximum value of each column and means the method has the best performance.

Although our method does not rely on handcraft bounding boxes, the performance is close to YOLOv3 and even better on IoU_m than YOLOv3. In addition, YOLOv3 detects the bounding box by regression, so the number of detected bounding boxes is inconstant. As shown in Figure 7, both the upper left image and the lower left image do not detect the windshield region and bumper region, respectively. The image in the lower right directly lost all regions. An additional windshield region was detected in the upper right image, and it was wrong.

Figure 7.

The failure instances by YOLOv3.

Cases analysis

In this section, we show the results of some samples in Figure 8. It can be seen intuitively that the proposed architecture is robust in various environments. For example, our proposed architecture also has satisfactory local regions detection performance under point of view shift, occlusion, lighting changes, and so on. And our method is robust enough in those environments. The obtained bumper region can almost cover the areas which include valuable information like logo, license plate, and lights. And obtained windshield region can almost cover the areas of passengers, annual inspection marks, and decorations. Those regions precisely have the most important personal information for a vehicle and they can also be used in various vehicle tasks. Furthermore, from our experiments, it is obviously shown that our architecture is easily portable into various CNNs.

Figure 8.

The local region detection results of the proposed architecture when ResNet-50 as the backbone.

We also show some failure cases through Figure 9. Although our method is sufficiently robust in most environments, there are still failures in real and complex environments. The left-top sample in Figure 9 shows the case of region missing cause by the longitudinal tensile. So, image distortion is negative to performance. The right-top sample in Figure 9 shows the case of region missing cause by the camera translation. It is no doubt that a complete input benefits detection. Furthermore, the structure of the training data set may also cause the detected bounding boxes to shift, which is similar to the left-bottom sample in Figure 9. Although our method has been robust for point of view change, the excessive rotation can bring the information loss, which will cause the detected bounding box deviation shown as the right-bottom sample in Figure 9.

Figure 9.

The failure instances. There are four different types of failure instances shown in the above figure.

Conclusion

We propose a portable and multi-attribute self-attention guided dual-wire architecture for vehicle local region detection. Our method has achieved satisfactory results whether quantitative or qualitative indicators. We realize local region detection training without handcraft bounding boxes. It means that we can adaptively generate the different attributes corresponding region with the help of self-attention block and proposed MSPP. Although our method achieves satisfactory performance on reconstructed PKU_VD data set, it may fail on data sets that have inconsistent vehicle-to-background ratio. This is caused by a fixed scale pooling layer.

In future work, we will focus on how to use the adaptive kernel size pooling layer in our architecture to achieve the adaptive local region detection.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Key R&D Program of China under grant 2018YFB1305200, and the National Natural Science Foundation of China under grants U1509207, 61876167, and 61802348.

ORCID iD

Jingbo Chen

References

Psyllos

Anagnostopoulos

Kayafas

. Vehicle model recognition from frontal view image measurements. Comput Stand Interfaces 2011; 33(2): 142–151.

Fang

Zhou

, et al. Fine-grained vehicle model recognition using a coarse-to-fine convolutional neural network architecture. IEEE Trans Intell Transp Syst 2016; 18(7): 1782–1792.

Lee

Gwak

Jeon

. Vehicle model recognition in video. Int J Signal Process Image Process Pattern Recognit 2013; 6(2): 175.

Pearce

Pears

. Automatic make and model recognition from frontal images of cars. In: 2011 8th IEEE international conference on advanced video and signal based surveillance (AVSS), Klagenfurt, Austria, 30 August–2 September 2011, pp. 373–378. IEEE.

Zhao

, et al. Part-regularized near-duplicate vehicle re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Long Beach, CA, USA, 15–20 June 2019, pp. 3997–4005. IEEE.

Zhang

Cao

, et al. Part-guided attention learning for vehicle re-identification. arXiv preprint arXiv:190906023, 2019.

Zhao

Shen

Wang

, et al. Structural analysis of attributes for vehicle re-identification and retrieval. IEEE Trans Intell Transp Syst 2020; 21(2): 723–734.

Silva

Jung

. Real-time Brazilian license plate detection and recognition using deep convolutional neural networks. In: 2017 30th SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), Niteroi, Brazil, 17–20 October 2017, pp. 55–62. IEEE.

Yang

Meng

, et al. Towards end-to-end license plate detection and recognition: a large dataset and baseline. In: Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018, pp. 261–277. Berlin: Springer.

10.

Artan

Bulan

Loce

, et al. Passenger compartment violation detection in HOV/HOT lanes. IEEE Trans Intell Transp Syst 2015; 17(2): 395–405.

11.

Yuan

Meng

Wei

. A method of location the vehicle windshield region for vehicle occupant detection system. In: 2012 IEEE 11th international conference on signal processing, Beijing, China, 21–25 October 2012, vol. 1, pp. 712–715. IEEE.

12.

Wang

Liu

Zhang

, et al. A spatio-temporal CRF for human interaction understanding. IEEE Trans Circuits Syst Video Technol 2016; 27(8): 1647–1660.

13.

Cai

Liu

Zhang

, et al. Visual focus of attention estimation using eye center localization. IEEE Syst J 2015; 11(3): 1320–1325.

14.

Zhang

Chen

Liu

, et al. Normalized weighted shape context and its application in feature-based matching. Opt Eng 2008; 47(9): 097201.

15.

Mountelos

Amanatiadis

Sirakoulis

, et al. Vehicle windshield detection by fast and compact encoder-decoder FCN architecture. In: 2019 8th international conference on modern circuits and systems technologies (MOCAST), Thessaloniki, Greece, 13–15 May 2019, pp. 1–4. IEEE.

16.

Redmon

Farhadi

. YOLOv3: an incremental improvement. arXiv preprint arXiv:180402767, 2018.

17.

Yan

Tian

Wang

, et al. Exploiting multi-grain ranking constraints for precisely searching visually-similar vehicles. In: Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22–29 October 2017, pp. 562–570. IEEE.

18.

Liu

Mei

, et al. PROVID: progressive and multimodal vehicle reidentification for large-scale urban surveillance. IEEE Trans Multimedia 2017; 20(3): 645–658.

19.

Liu

Tian

Yang

, et al. Deep relative distance learning: tell the difference between similar vehicles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016, pp. 2167–2175. IEEE.

20.

Zhou

Shao

. Cross-view GAN based vehicle generation for re-identification. In: Proceedings of the British Machine Vision Conference (BMVC), London, UK, 4–7 September 2017, pp. 186.1–186.12. BMVA Press.

21.

Chu

Sun

, et al. Vehicle re-identification with viewpoint-aware metric learning. In: Proceedings of the IEEE international conference on computer vision, Seoul, South Korea, 27 October–2 November 2019, pp. 8282–8291. IEEE.

22.

Ding

Lin

Wang

, et al. Deep feature learning with relative distance comparison for person re-identification. Pattern Recognit 2015; 48(10): 2993–3003.

23.

Schroff

Kalenichenko

Philbin

. FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA, 7–12 June 2015, pp. 815–823. IEEE.

24.

Szegedy

Vanhoucke

Ioffe

, et al. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016, pp. 2818–2826. IEEE.

25.

Hermans

Beyer

Leibe

. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:170307737, 2017.

26.

Wang

Girshick

Gupta

, et al. Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 18–23 June 2018, pp. 7794–7803. IEEE.

27.

Liao

Yang

, et al. Video-based person re-identification via 3D convolutional networks and non-local attention. In: Asian conference on computer vision, Perth, Australia, 2–6 December 2018, pp. 620–634. Berlin: Springer.

28.

Yuan

, et al. Interaction-aware spatio-temporal pyramid attention networks for action classification. In: Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018, pp. 373–389. Berlin: Springer.

29.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. In: Advances in neural information processing systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017, pp. 5998–6008.

30.

Simonyan

Zisserman

. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556, 2014.

31.

Szegedy

Liu

Jia

, et al. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA, 7–12 June 2015, pp. 1–9. IEEE.

32.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778. IEEE.

Multi-attribute self-attention guided vehicle local region detection based on convolutional neural network architecture

Abstract

Keywords

Introduction

Proposed method

Dual-branch architecture

Loss

Self-attention block

Positioning

Experiments

Data set and setup

Evaluation criteria

Ablation analysis

Comparison

Cases analysis

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References