Visual tracking with multilevel feature,similarity attention,color constraint,and global redetection

Abstract

Visual tracking is fundamental in computer vision tasks. The Siamese-based trackers have shown surprising effectiveness in recent years. However, two points have been neglected: firstly, few of them focus on fusing the image level and semantic level features in neural networks, which usually resulting in tracking failure when differentiating the target from other distractors of the same class. Secondly, the robustness of the previous redetection scheme is limited by simply expanding the search region. To address these two issues, we propose a novel multilevel feature-weighted Siamese region proposal network tracker, which employs a feature fusion module to construct discriminative feature embedding and a similarity-based attention module to suppress the distractors in the search region. Furthermore, a color-based constraint module is presented to further suppress the distractors with the same class to the target. Finally, a well-designed global redetection scheme is built to handle long-term tracking tasks. The proposed tracker achieves state-of-art performance on a series of popular benchmarks, including object tracking benchmark 2013 (0.699 in success score), object tracking benchmark 2015 (0.700 in success score), visual object tracking 2017 (0.470 in expected average overlap score), and visual object tracking (0.485 in expected average overlap score).

Keywords

Visual tracking siameseRPN color-based constraint module

Introduction

Visual tracking is a technique to track the target in an image sequence given the target’s bounding box in the first frame as the template. With the progress of computer vision, this task has received close attention in recent years and is widely used in intelligent applications, such as robotics, autonomous car, and video surveillance. In the development of visual tracking, many excellent works have been proposed. Most of them have addressed several challenging problems such as vague target, rotation, deformation, and illumination variation. Furthermore, other works in long-term (LT) visual tracking focus on overcoming the target disappearance in a video.

Recently, due to the trade-off between accuracy and speed, trackers based on the Siamese network^{1

–15} are considered as the mainstream methods. Siamese-based trackers were proposed. They extract features of the template and a search frame by two convolution structures with shared weights and generate a similarity map between the template and this frame. What’s more, some improved Siamese-based trackers can be used to predict target deformation^{5
–7,9,16

–24} or even directly output target segmentation,¹⁵ which have made significant achievements. However, two key issues remain. Firstly, the existing Siamese-based methods perform poorly in distinguishing the target and the distractors with the same class of the target. The reason is that conventional Siamese-based trackers utilize the backbone of the classification task. Thus, the high-level features fail to represent the difference between the target and other objects with the same class. Based on these features, the obtained similarity map is also unreliable to distinguish the target and others. As shown in Figure 1, the previous method has high responses on the distractors, that is, the referee, athletes marked in red, where the response indicates the similarity between these candidates and the template. Secondly, the previous redetection scheme simply expands the search region, thus limiting the performance of the LT tracking.

Figure 1.

Visualization of the similarity of Siamese-based tracker. (a) The template, (b) the search region, target, and distractors are marked in green and red boxes, respectively, and (c) the similarity maps of different Siamese-based trackers. Yellow and blue indicate high and low similarity.

In this article, we argue that since the high-level cue of the target and a same-class object are similar, the low-level cue plays a key role in this case. In contrast, high-level cues are essential to distinguish the target from other similar appearance objects. Thus, the features of different levels are useful in different conditions. For this reason, we propose a multilevel feature-weighted Siamese region proposal network (MFW-SiamRPN). In contrast to the previous works, a feature fusion module (FFM) is designed to fuse all level features to a unified representation, which is used to encode the similarity between the template and each sliding window in the search region. Then, we design a similarity attention module. It utilizes channel attention to suppress channels that fail to distinguish template and distractors and utilizes spatial attention to suppress the similarities between template and distractors. Besides, to make full use of the color information lost in the deep network, a color-based constraint module (CCM) is proposed to suppress the network’s output. Finally, to track a target in LT scenarios, a global redetection scheme is proposed to detect the occurrence that the target is out of the view and redetect the target, which is simple but efficient.

In general, the main contributions can be summarized as follows:

To distinguish the target and other distractors, we propose an MFW-SiamRPN (MFW-SiamRPN). It utilizes multilevel cues to template and search region and suppresses other distractors by a similarity attention module.

A CCM, which takes full use of color information, is designed to constraint the network output.

A well-designed global redetection scheme is proposed, which ensures the robustness of the proposed tracker and outperforms other Siamese-based trackers in LT tracking.

Our method outperforms the existing state-of-the-art methods on the fashionable benchmarks: object tracking benchmark (OTB)-2013,²⁵ OTB-2015,²⁶ visual object tracking (VOT)-2017,²⁷ VOT2018,²⁸ and VOT-2018-LT.²⁹ Moreover, abundant ablation studies verify our point of view and the effectiveness of each module.

Related work

Review of Siamese-based trackers

Recently, trackers based on the Siamese network draw great attention for their remarkable performance in both accuracy and efficiency. These methods employ two identical networks initialized in the classification task (e.g. AlexNet,³⁰ Visual Geometry Group (VGG),³¹ ResNet,²⁸ and MobileNet³²) to extract features from the template and search region and locate the target by matching the template. Generic Object Tracking Using Regression Networks (GOTURN)³³ directly concatenates the features of the template and search region and employs several fully connected layers to locate the target. Siamese Instance Search for Tracking (SINT)³⁴ constructs the vectors of the target and several candidates by region-of-interest pooling and then outputs the candidate most similar to the template. Fully convolutional (FC) Siamese networks (SiamFC)¹ proposes a cross-correlation layer to obtain the similarity between the target and the search region, improving accuracy and efficiency. Based on SiamFC, RasNet³ proposes a residual attention module, which reduces the noise in the deep features. As shown in Figure 2(a) and (b), SiamRPN^5,24 and SiamRPN++⁷ use a well-designed region proposal network (RPN)³⁵ to predict the position and deformation of the target simultaneously and outperform other Siamese-based trackers. However, since the backbones used in these methods are all pretrained in the classification task, the inherent difference between the two tasks, namely, classification and tracking, making the deep feature fail to represent the difference between the target and other objects with the same class. We propose MFW-SiamRPN and CCM to overcome this issue. As shown in Figure 2, different from others, all level features are fused to represent the template and the search region.

Figure 2.

Structure comparison. (a) SiamRPN⁵ exploits the highest-level feature for similarity calculation. (b) Siam RPN++⁷ employs the last three-level features of encoder to construct the similarity. (c) Our network takes both the low-level image information and the high-level semantic information into account by a fusion module and further designs the attention of similarity to improve the accuracy. RPN: region proposal network.

LT tracking and redetection scheme

In visual tracking, the target is usually occluded or moved out of the view. To address the issue, the LT tracking redetects the target when the target returns to the view of the camera. There are mainly two types of methods for LT tracking. Firstly, some methods redetect the target by expanding the search region in the frame. DaSiamRPN⁶ and SiamRPN++⁷ establish the reliable confidence score by specially designed training samples but thoughtlessly enlarge the search region while losing the target, which limits the performance of the LT tracking. Secondly, a few methods construct a redetection scheme. Skimming-Perusal Long-term Tracking (SPLT)¹⁴ builds a global redetection scheme by training an additional network called skim-module and surpasses other methods that enlarge the search region. However, the additional skimming model requires feature maps of the whole search region and downsamples it to a vector, which brings high Graphics Processing Unit (GPU) cost and risk of low accuracy.

The redetection task could be divided into two steps, object detection and target identification. We find several simple but effective proposals to locate an object,^36

–40 and the excellent Siamese-based trackers could handle the target identification.¹⁸ In the section “Redetection scheme for LT tracking,” we apply object proposal to locate candidate objects and combine our CCM with the Siamese network to identify the target.

Method

Our method is described in the following order. In the section “Multilevel feature-weighted SiameseRPN,” considering that the low-level feature plays a vital role in distinguishing the target and other objects, the MFW-SiamRPN is presented, which weights features of all levels and then suppresses the similarity between the false positives and the target. In the section “Color-based constraint module,” to take color information into account, a CCM is presented to constrain the network’s output. In the section “Redetection scheme for LT tracking,” the redetection scheme for LT tracking is proposed to address the issue that the target disappears or is fully occluded.

Multilevel feature-weighted SiameseRPN

Overview

We adopt SiamRPN⁵ as the basic tracker. Figure 3 shows the structure of our network. Given the first frame I ₁ and the current frame I_n , the whole network takes a template $z \in ℝ^{3 \times}^{2 w_{T} \times 2 h_{T}}$ from frame I ₁ and a search region x from frame I_n as inputs, where w_T and h_T are the width and height of the target. And the template is centered on the target and has red, green, and blue (RGB) channels and a resolution of 2w_T × h_T . Our network aims to locate the target in the search region x. In the following paragraphs, we first introduce the entire network and then represent the proposed modules in detail.

Figure 3.

The pipeline of our network. The ResNet-50 is employed to extract the feature of the template region and the search region. Then an FFM is used to aggregate features of all levels to a unified feature map that is further used to calculate the similarity between template region and search region. Subsequently, two attention modules are introduced to improve the similarity map. After all, the predictor locates the target in the search region. FFM: feature fusion module.

Mathematically, let F ?denotes fusing the feature of the backbone (see the section “Multilevel feature-weighted SiameseRPN”), the similarity between template z and search region ?can be formulated as follows

f (z, x) = F (ϕ (z)) * F (ϕ (x))

where ∗ denotes the depth-wise cross-correlation operation,⁷ ϕ?represents the feature extraction of the backbone, that is, a ResNet-50 pretrained on ImageNet.⁴¹ The channel number of the similarity map $f (z, x) \in ℝ^{C \times}^{H_{s} \times W_{s}}$ is 512, the height is 17, and the width is 17, namely C = 512, H_s = 17, and W _s = 17. Then, the attention module based on similarity is designed to suppress the similarities between distractors and the template (see the section “Multilevel feature-weighted SiameseRPN”). Let A _s() denotes the spatial attention and A _c() denotes the channel attention. The weighted similarity can be formulated as follows

f (z, x) = A_{s} (A_{c} (f (z, x)) \otimes f (z, x)) \otimes (A_{c} (f (z, x)) \otimes f (z, x))

where $\otimes$ denotes the element-wise multiplication. Finally, following the same way in ref,⁵ we generate many anchor boxes in search region x as the candidates of the target. And the similarity f(z,x) is utilized to predict the probability that each box is the target by using two convolution branches, namely Classification (CLS) branch and Regression (REG) branch. The CLS branch predicts the similarity between each anchor boxes and the target (denoted by N_A score in the section “Color-based constraint module”). The REG branch predicts the deformation and displacement [d_x , d_y , d_w , d_h ] of each anchor box. Mathematically, the two branches can be formulated as follows

\begin{array}{l} CLS(z, x) = C_{1} (f (z, x)) \\ REG(z, x) = C_{1} (f (z, x)) \end{array}

where C ₁ denotes a 1 × 1 convolution operation, which adjusts the channels of f(z,x) to 2 (for CLS) and 4 (for REG). By using a SoftMax operation, CLS(z,x) is transformed into a similarity map of template and search region in the corresponding positions.

During training, we use the SoftMax loss function to supervise the CLS branch, while the REG branch is optimized by Smooth L1 loss function. The total training loss is the summation of them. Formally, for the CLS branch in our network, the prediction is denoted as $G \in ℝ^{2 \times w_{s} \times H_{s}}$ , the label is denoted as $\overset{⌢}{G} \in ℝ^{1 \times w_{s} \times H_{s}}$ , their resolutions are same with similarity f(z,x). The SoftMax operation is utilized to calculate the probability of a pixel belong to the target: $G (i, j) = \frac{G (1, i, j)}{G (0, i, j) + G (1, i, j)}$ , and the SoftMax loss can be formulated as follows

l_{softmax} = - \sum_{j = 1}^{W_{s}} \sum_{i = 1}^{H_{s}} \overset{⌢}{G} (i, j) \times log (G (i, j))

For the REG branch in our network, the prediction is denoted as $L \in ℝ^{4 \times w_{s} \times H_{s}}$ , each pixel x of L response is [d_x , d_y , d_w , d_h ] and the corresponding label is $\overset{⌢}{y}$ , the smooth L1 loss value at each corresponding position can be calculated as follows

l_{smooth} = {\begin{cases} 0.5 \times (x - \overset{⌢}{y})^{2}, | x - \overset{⌢}{y} | < 1 \\ | x - \overset{⌢}{y} |, | x - \overset{⌢}{y} | \geq 1 \end{cases}

Feature fusion module

Based on the motivation: both the low-level feature and the high-level feature are crucial to distinguish the target and others, a FFM is proposed to capture the feature from all levels. In detail, the stage-n side-outputs of the backbone are represented as Res n. The features of the first two stages, namely Res 1 and Res 2, are down-sampled to one-eighth size of input images by two parallel pooling layers, namely average pooling and max pooling. The average pooling is to capture the relation between channels, and the max-pooling is to gather another important cue about distinctive object features.⁴² Then, these down-sampled features are concatenated with Res 3, Res 4, and Res 5. Finally, a bottle-neck structure, namely a sequence of two 1 × 1 convolutions and a 3 × 3 convolution, is utilized to nonlinearly compress all the features and reduce the channel number to 512. Formally, the FFM can be formulated as follows

\begin{array}{l} f = Cat(MP(Res1), AP(Res1), MP(Res2),
AP(Res2)Res3,Res4,Res5) \\ F () = relu(C_{1} (relu(C_{3} (relu(C_{1} (f)))))) \end{array}

where MP()denotes the max pooling operation, AP()denotes the average pooling operation, Cat() denotes the concatenation operation, C ₃() denotes the 3 × 3 convolution operation, relu() denotes the Rectified Linear Unit (ReLU) activation.

Since the low-level features contain little semantic information, such features could be used to distinguish two objects of the same class. In contrast, high-level features could be used to separate objects with different classes. Hence, multilevel features are all necessary to distinguish the target and other objects, and the fused features are better to represent all the objects than those in the previous works.

Attention based on similarity

As shown in Figure 3, the similarity $f (z, x) \in ℝ^{C \times H_{s} \times W_{s}}$ is obtained by the depth-wise cross-correlation⁵ of the two fused features. To further enhance the difference between the template and other distractors, we capture the attention of the similarity. Specifically, two types of attention information are captured sequentially, namely channel attention and spatial attention. These two submodules are motivated by Convolutional Block Attention Module (CBAM).⁴² Firstly, the channel attention module is utilized to capture the interchannel relationship of similarity f(z,x). As shown in Figure 4, it uses the average pooling operation to aggregate the spatial information and uses the max pooling operation to capture the discriminative cue in similarity matching. Then, a shared convolution block, which is constructed by three 1 × 1 convolutions, is utilized to fuse both pooled features. Finally, a sigmoid function is used to generate channel weights $A_{c} (f (z, x)) \in ℝ^{C \times 1 \times 1}$ . Formally, the channel weights can be formulated as follows

Figure 4.

The detail of our channel attention (left), spatial attention (middle), and similarity maps after attention (right).

A_{c} (X) = Sig (C_{1} (relu(C_{1} (relu(C_{1} (MP(X))))))+ (C_{1} (relu(C_{1} (relu(C_{1} (AP(X))))))

where Sig() denotes the sigmoid function, and X denotes the input feature. The spatial attention module A_s () is utilized to capture the interspatial relationship of similarity. Specifically, the feature map is firstly passed through a channel-wise max-pooling layer and a channel-wise average pooling layer to be down-sampled to two feature maps with one channel, respectively. For the input feature X with C channels, the channel-wise max pooling and average pooling can be formulated as follows

\begin{array}{l} X_{c_avgpool} (i, j) = mean(X (0: C, i, j)) \\ X_{c_maxpool} (i, j) = max(X (0: C, i, j)) \end{array}

Then, these two one-channel feature maps are concatenated and passed through a 3 × 3 convolution and a sigmoid to generate the spatial attention

A_{s} (X) = Sig (C_{3} (Cat(X_{c_avgpool}, X_{c_maxpool})))

Finally, the output feature map can be formulated as follows

f (z, x) = A_{s} (A_{c} (f (z, x)) \otimes f (z, x))

where the bolded $f (z, x) \in ℝ^{C \times H_{s} \times W_{s}}$ denotes the feature after attention.

For the channel attention, since each channel’s discriminative ability in f(z,x) is different, the channel attention gives high weights to the channels of similarity f(z,x) with intense discrimination. Thus, compared to directly applying the original similarity to match the target, the weighted similarity better distinguishes the target and other distractors. For the spatial attention, since the pooled features contain each pixel’s discriminative channel, the spatial attention module weights all pixels, which enhances the response of the target and suppresses that of the distractors. Thus, our similarity attention enhances the difference between target and distractors in feature representation.

Color-based constraint module

Due to the normalization operation in Convolutional Neural Network (CNN), normalized color in different search regions and templates ranges variously, resulting in limited color information expression. To make full use of the color information, we propose the CCM, which takes the boxes predicted by the network as inputs and computes the color-based similarities between these boxes and the target. In detail, CCM contains three steps.

Firstly, assuming that the template z consists of two disjoint regions: the rectangle region of the target with size w_T ×h_T (i.e. T) and the rest of the template z is denoted as background area (i.e. B), this module constructs the color histograms of the two regions, respectively. Specifically, we present one pixel in the template with RGB channels by a triplet $i \in ℝ^{3}$ . Then, the RGB values of all pixels are reduced by 16 times: $i = [i / 16] \in ℝ^{3}$ . Finally, since each value of ?is an integer from 0 to 15, we count the number of occurrences of each triplet in region T to construct the normalized color histogram $H_{T} \in ℝ^{16 \times 16 \times 16}$ . In the same way, the histogram $H_{B} \in ℝ^{16 \times 16 \times 16}$ is constructed corresponding to region B.

Secondly, according to the color histograms H_T and H_B , the probability that a pixel i in search region x belongs to target is obtained by the Bayes rule (see additional materials for detailed formula)

P (i \in ℑ | T, B, i) = \frac{H_{T} ([i / 16])}{H_{T} ([i / 16]) + H_{B} ([i / 16])}

where $ℑ$ denotes the class of the target. Based on Eq. (11), the probabilities of all pixels in x are obtained to form a probability map S, as shown in Figure 5(c).

Figure 5.

The area with blue in (a) is the region of the target (T), while the rest surrounding area is background (B), (b) is a similarity map generated by MFW-SiamRPN, (c) is the target probability map of x, and (d) is the search region x, the distractors with high N_A scores are in red bounding boxes and the selected target is in green bounding boxes. MFW: multilevel feature-weighted; RPN: region proposal network.

Thirdly, we rerank the boxes given by the MFW-SiamRPN. Specifically, assuming that A = (x_A , y_A , h_A , w_A ) denotes a bounding box with the same representation of other articles. Thus, the target probability of box A is stated as the mean probability of all pixels inside A

{TS}_{A} = \frac{1}{h_{A} \times w_{A}} \sum_{i \in A} S (i)

where S(i) denotes the value of pixel i on the probability map S.

Then, we combine this color-based probability TS_A with the score given by the network

{SC}_{A} = (N_{A} \times (1 - α) + {TS}_{A} \times α) \times exp (1 - D_{x y} / 255) \times P_{A}

where N_A is the similarity score predicted by the CLS branch of our MFW-SiamRPN, D_xy is the distance between box A and the center of region x, P_A is the deformation penalty of the bounding box A, which is referred to ref.⁷ α is a constant to balance the Siamese network and the CCM. By utilizing Eq. (13), a more reliable target prediction is obtained than utilizing the network only, as shown in Figure 5(d).

Redetection scheme for LT tracking

To extend our network for LT tracking, we propose a redetection scheme, as shown in Figure 6. Since the score of the target SC given by CCM is reliable, the redetection scheme is activated if SC<0.7×SC₂, where SC ₂ denotes the target score in the second frame of the input video. The redetection scheme is composed of three steps: probability map-generating, object proposal, and target selection. Initially, the whole image is taken as the input, and a target probability map of this image is generated by Eq. (13) in CCM. Secondly, we employ the object proposal method, i.e. Edge boxes,³⁹ to obtain various bounding boxes with the same size as the target in the template or in the last frame. For clarity, we denote the bounding box as A, and its objectness score predicted by Edge boxes as OS (A). Finally, the boxes with the top 200 objectness scores are retained to the third step. Thirdly, all the bounding boxes {A_i |1≤i≤200} are reranked by the color-based probability. In more detail, by utilizing the target probability map given by the first step and utilizing Eq. (13), the target probability TS _Ai is obtained. Then, the score for LT tracking SL _Ai is formulated to rerank all the boxes

Figure 6.

The pipeline of our redetection scheme. we firstly employ edge boxes³⁹ to obtain the candidates with high objectness³⁹ and use TS _A score to rerank these candidates. Five search regions centered on the top five candidates are inputted to our network to redetect the target.

Figure 7.

Comparisons of our MFW-SiamRPN tracker with state-of-the-art Siamese-based trackers in challenging scenes of OTB2015 benchmark. Our tracker shows more robustness on scenarios with similar distractors, occlusion, flurries, illumination variation, and so on. MFW: multilevel feature-weighted; OTB: object tracking benchmark; RPN: region proposal network.

{SL}_{A} = ({OS}_{A} \times (1 - β) + {TS}_{A} \times β) \times exp (1 - \frac{D_{x y}}{\sqrt{w_{I} \times h_{I}}})

where OS _A is the objectness score of a bounding box A, TS _A is the CCM score in the section “Color-based constraint module,” β is a constant to balance the object proposal Module and the CCM. D_xy is the distance between the bounding box and the center of an image, w_I and h_I are the width and height of an image, respectively. By top selection of the Sl _A score, the top five candidate bounding boxes are left. These five candidates have a high probability of including objects and are similar to the target in color. Finally, five search regions centered on these candidates are sent into our Siamese network to predict the target’s location and size.

The whole redetection scheme in LT tasks could be seen in Figure 6. When the target disappears from the search region, it will obtain a low SC score and the tracker activates the global redetection scheme. The object proposal method helps our tracker locate candidate objects and the CCM helps our tracker filter out objects that are not similar to the target in color. Consequently, the Siamese network predicts the bounding box of the target precisely.

Experiments and evaluations

Implementation details

The training pairs of our Siamese network come from three data sets. Large-scale video detection data set (ImageNet Large Scale Visual Recognition Challenge [ILSVRC] 2015⁴¹) contains 4417 videos of 30 different objects and over 1 million images of 1000 different objects. Microsoft Common Object in Context (COCO)⁴³ data set contains 328,000 images of 91 different objects. YouTube-BoundingBoxes⁴⁴ data set contains 210,000 videos of 13 different objects. All labeled objects in these images and videos are cropped with the target centered in it and double the size of $\sqrt{w_{T} \times h_{T}}$ . After the crop operation, these template patches are resized to 127 × 127 pixels. The search regions are cropped twice the size of template patches in the same ways.

The proposed Siamese network is trained for 20 epochs with a batch size of 20 and an initial learning rate of 10⁻ ² by using the stochastic gradient descent (SGD) solver. The momentum and weight decay are set to 0.1 and 0.5, respectively. Parameters of the backbone are frozen in the first 10 epochs and unlocked in the last 10 epochs to adapt to tracking tasks.

Evaluation on OTB data set

OTB2013¹⁶ consists of 50 labeled videos and brings challenging scenes such as rotation, blur, and occlusion. OTB2015²⁶ is an extension of OTB2013, including 100 labeled sequences, and contains more complicated scenes. The precision score and the success score are used to evaluate the performance.

The precision score represents the ratio of predicted results (the distance between the predicted center and ground-truth center is within 20 pixels). The success score shows the Area Under Curve (AUC) of the successful plot. The x-axis of the success plot indicates the IOU threshold between the predicted and ground-truth result (varies from 0 to 1), whereas the y-axis shows the proportion of successful frames, which satisfied the IOU threshold. One-pass evaluation is employed for the evaluation.

Our tracker is compared with several state-of-art Siamese-based trackers, including SiamFC,¹ parallel tracking and verifying (PTAV),⁴ RasNet,³ Correlation Filter based Network (CFNet),² SiamRPN,⁵ DaSiamRPN,⁶ SiamRPN++,⁷ SiamDW,⁸ Siamese Cascaded Region Proposal Network (C-RPN),⁹ Discriminative Model Prediction (DIMP),¹⁰ GradNet,¹¹ and Meta-Learning Tracker (MLT).¹² Table 1 presents the precision scores and success scores of Siamese-based trackers mentioned earlier. As presented in Table 1, our tracker outperforms these state-of-art Siamese-based trackers over two benchmarks. Specifically, our tracker achieves the best precision (0.930 on OTB2013 and 0.915 on OTB2015). For the success score, our tracker superiorly performs against other evaluated methods (0.699 on OTB2013 and 0.7 on OTB2015). Compared with the AlexNet-based methods, our tracker benefits from the Res50 backbone and outperforms DaSiamRPN by 3.5% on the precision of OTB2015. Note that our method only achieves a little improvement over SiamRPN++, which is due to a lot of gray-scale videos in OTB 2015, and no color cue can be utilized. Besides, compared with other methods, whose backbone are deep CNNs, our tracker benefits from the FFM, the attention module, and the CCM. The res50 backbone extracts features that help to distinguish the target from the background. Low-level cues utilized by the FFM help our tracker distinguish the target from distractors with the same class, while high-level cues help our tracker distinguish the target from distractors with a similar appearance. The CCM replenishes color information, which is lost in normalization operation. The performance of our tracker proves the effectiveness of these modules.

Table 1.

Precision and success results on OTB2013 (left) and OTB2015 (right). Red bold type indicates the best performance, and blue bold type indicates the second-best performance (the same below).

Methods	Backbones	OTB2013		OTB2015
Methods	Backbones	Precision	Success	Precision	Success
SiamFC	AlexNet	0.809	0.608	0.771	0.582
PTAV	AlexNet	0.895	0.663	0.859	0.635
RasNet	AlexNet	0.892	0.670	0.693	0.642
CFNet	AlexNet	0.746	0.611	—	0.568
SiamRPN	AlexNet	−	−	0.851	0.637
DaSiamRPN	AlexNet	−	−	0.880	—
SiamRPN++	Res50	−	−	0.914	0.696
SiamDW	Res50	0.920	0.670	0.900	0.670
C-RPN	Res50	−	0.675	−	0.663
DIMP	AlexNet	−	−	—	0.684
GradNet	AlexNet	−	−	0.861	0.639
MLT	AlexNet	−	0.621	−	0.611
Ours	Res50	0.930	0.699	0.915	0.700

FC: fully convolutional; OTB: object tracking benchmark; RPN: region proposal network; PTAV: parallel tracking and verifying.Evaluation on VOT data set

VOT2017²⁷ and VOT2018²⁸ consist of 60 sequences and show more difficult scenes than OTB benchmarks. We use the most-watched expected average overlap (EAO) score on the baseline benchmark as the evaluation standard. EAO is an estimator of the average overlap a tracker is expected to attain on a large collection of short-term sequences. Besides, we also choose the accuracy (average overlap when tracking successfully) and the Robustness (failure times) as evaluation standards.

As shown in Table 2, our tracker is compared with the state-of-art tracking algorithms, including SiamFC, SiamRPN, DaSiamRPN, SiamRPN++, SiamDW, C-RPN, GradNet, DIMP, UpdateNet.¹³ Our approach shows the best EAO score (0.470 on VOT2017 and 0.485 on VOT2018). SiamRPN++ and DIMP have the same backbone as our tracker. These three Res50 based trackers have similar accuracy scores. Moreover, DIMP shows less tracking failures (0.153) followed by our tracker (0.201). As to the most concerned EAO score on VOT2018, our tracker outstands the DIMP by 4.5% and SiamRPN++ by 7.1%, respectively, which sufficiently proves the effectiveness of the proposed method.

Table 2.

Evaluation results on VOT2017 (left) and VOT2018 (right).

Trackers	Backbones	VOT2017			VOT2018
Trackers	Backbones	A (↑)	R (↓)	EAO (↑)	A (↑)	R (↓)	EAO (↑)
SiamFC	AlexNet	0.502	0.585	0.188	−	−	−
SiamRPN	AlexNet	0.490	0.460	0.244	0.585	0.276	0.383
DaSiamRPN	AlexNet	−	−	−	0.560	0.340	0.326
SiamRPN++	Res50	−	−	−	0.600	0.234	0.414
SiamDW	Res50	0.520	0.410	0.301	−	−	−
C-RPN	Res50	−	−	0.289	−	−	−
GradNet	AlexNet	0.507	0.375	0.247	−	−	−
DIMP	AlexNet	−	−	−	0.597	0.153	0.440
UpdateNet	AlexNet	−	−	−	−	−	0.393
Ours	Res50	0.620	0.210	0.470	0.601	0.201	0.485

FC: fully convolutional; VOT: visual object tracking; A: accuracy; EAO: expected average overlap (both larger are better); R: robustness score (smaller is better); RPN: region proposal network.

Evaluation on the VOT2018-LT data set

We choose the VOT2018-LT²⁹ data set with a lot of completely lost scenes to show the superiority of our redetection scheme. This data set includes totally 146,847 frames with 35 sequences of various objects. Each sequence contains at least 1000 images and average 12 LT object disappearances. The main evaluation protocol of the VOT2018-LT data set is F1 score which is defended by

F 1 (θ) = \frac{2 pr (θ) \times Re (θ)}{pr (θ) + Re (θ)}

where θ is a given threshold. Pr(θ), Re (θ) are the precision, recall, and F1 score under the corresponding threshold, respectively. The threshold varies from 0 to 1 to find the maximum F1 score.

Accordingly, we choose some trackers specifically optimized for LT tracking to compare their performance, including, SiamFC, DaSiamRPNLT, SiamRPN++LT, PTAVplus,⁴ and SPLT.¹⁴ The evaluation result is placed in Table 3.

Table 3.

Evaluation results on VOT2018-LT (for all metrics, larger is better).

Tracker	F1	Precision	Recall
SiamFC	0.433	0.636	0.328
DaSiamRPN-LT	0.607	0.627	0.588
PTAVplus	0.481	0.595	0.404
SPLT	0.616	0.633	0.600
SiamRPN++LT	0.629	−	−
Ours	0.641	0.633	0.625

FC: fully convolutional; VOT: visual object tracking; RPN: region proposal network; LT: long-term; PTAV: parallel tracking and verifying.

Compared with none redetection trackers, our tracker has advantages in the F1 score for these two trackers only enlarge the search region without an appropriate redetection scheme. Based on its outstanding network backbone and redetection scheme, even our tracker still has advantages over SPLT that build a redetection scheme from the F1 score (an improvement of 2.5%).

Ablation studies

In this section, we firstly present the positive function of the MFW-SiamRPN and CCM in short-term tracking tasks. Then, the analysis of LT tasks is brought out to show the contribution of the redetection scheme in Table 4.

Table 4.

Ablation study of our tracker on OTB2015 (success score), VOT2018 (EAO score), and VOT2018-LT (F1 score).

Number	Network	Backbone	L1	L2	L3	L4	L5	Att	Ta	Re	OTB2015	VOT2018	LT
i	SiamRPN	AlexNet									0.637	0.326	0.607
ii	SiamRPN								•		0.670	0.360	—
iii	SiamRPN								•	•	—	—	0.621
iv	SiamRPN++	Res50	•								0.601	0.128	—
v	SiamRPN++			•							0.610	0.189	—
vi	SiamRPN++				•						0.669	0.331	—
vii	SiamRPN++					•					0.678	0.374	—
viii	SiamRPN++						•				0.646	0.320	—
ix	SiamRPN++	Res50			•	•	•				0.696	0.414	0.629
x	SiamRPN++				•	•	•		•		0.690	0.421	—
xi	SiamRPN++				•	•	•		•	•	—	—	0.631
xii	Ours	Res50	•	•	•	•	•				0.690	0.420	—
xiii	Ours		Fused by FFM								0.696	0.431	—
xiv	Ours		Fused by FFM					•			0.699	0.461	—
xv	Ours		Fused by FFM					•	•		0.700	0.485	0.615
xvi	Ours		Fused by FFM					•	•	•	—	—	0.641

OTB: object tracking benchmark; VOT: visual object tracking; FFM: feature fusion module; CCM: color-based constraint module; RPN: region proposal network; LT: long-term; L1–L5: Res1–Res5, respectively; Att: the proposed attention module; Ta: the proposed CCM; Re: the proposed Re-detection scheme.

Analysis of FFM

To pursue the influence of our FFM on the tracking result, the features of the first two layers are trained with a single output structure. The comparison among iv–viii shows that by independent operation, the features of Res1 and Res2 perform terribly in both OTB2015 and VOT2018 benchmarks. ix and xii show that the introduction of features from Res1 and Res2 can improve the performance on the VOT2018 benchmark by 0.6% but reduce the performance on OTB2015 by 0.6%, which is due to the inefficient features fusion. The performance gap between independent operation (iv–viii) and collaborative operation (ix, xii) of multilevel features confirms our conclusion that all features of the backbone should be taken into consideration. From ix, xii, and xiii, we find that simply using the average outputs of multilayers is not an appropriate solution. The FFM gains an improvement of 1.1% over the average type baseline, which fully proves the effectiveness of the proposed FFM.

Analysis of attention structure

From Table 4, we can also deduce the advantages of our attention structure. The xiii and xiv present that, through our attention structure, the feature-fused Res50 tracker gets a competitive performance (0.461 on EAO) 4.7% better than SiamRPN++ and 3.0% than the fused baseline. What’s more, to investigate the impact of different attention types, some additional attention evaluations are proposed in Table 5. Evaluations in Table 5 show that by changing the order of channel attention and spatial attention, there is a slight drop of 0.4% on the success score in OTB2015, which is in contrast to an increase of 0.1% on EAO in VOT2018. In the VOT2018 benchmark, channel attention and spatial attention bring an increase of 0.9% and 2.4%, respectively.

Table 5.

Evaluation results on different types of attention.

Attention type	OTB2015 (success)	VOT2018 (EAO)
None	0.696	0.431
Channel only	0.696	0.440
Spatial only	0.696	0.455
Due (SAC)	0.699	0.461
Due (parallel)	0.695	0.462

SAC: spatial attention after channel attention; Parallel: two kinds of attention run parallel like RasNet3; OTB: object tracking benchmark; VOT: visual object tracking; EAO: expected average overlap.

Analysis of color-based constraint module

The xiv and xv in Table 4 prove the advantages of our CCM. Our tracker gains an improvement of 2.4% on the VOT2018 benchmark. However, our CCM does not work on gray sequences and brings little improvement in OTB2015. Furthermore, we infer that the absence of color information should be a generality problem for all Siamese networks. The ii, x, and xv in Table 4 confirm this speculation. Our CCM brings an improvement of 3.4% for DaSiamRPN and 0.7% for SiamRPN++.

Analysis of redetection module

The comparison of xv and xvi in Table 4 shows that the MFW-SiamRPN tracker with our global redetection scheme has an improvement of 2.6% on the VOT2018-LT benchmark. What’s more, any tracker that lacks a global redetection process could benefit from our redetection scheme on LT benchmarks. The comparison of x and xi in Table 4 shows that SiamRPN++ with our global redetection scheme gains a 0.2% improvement. At the same time, ii and iii in Table 4 show that our global redetection scheme brings an improvement of 1.4% for SiamRPN.

Conclusion

In this article, we present a novel MFW-SiamRPN to track the template in a video. Our contributions are motivated by two intuitions. Firstly, too much attention to high-level features leads to ignorance of the capability of low-level features to differentiate between same-class objects. For this reason, the FFM is designed to aggregate features from the lowest level to the highest level. Then, similarity attention is proposed to select more discriminative features and suppress the similarity between distractors and the target. And the CCM is proposed to make full use of the color information lost in the deep network. As a result, our method tracks the target stably even when same-class objects interfere. Secondly, to track the target robustly in an LT video, a global redetection scheme is proposed to precept the target’s condition and redetect the target. Numerous experiments on several data sets demonstrate the effectiveness of our method.

Footnotes

Author contribution

Song Guiling and Zhang jingyi are contributed equally to this work, and the contributions of other authors are in the order of the author sequence.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Song Guiling

References

Bertinetto

Valmadre

Henriques

, et al. Fully-convolutional Siamese networks for object tracking. In: ECCV workshops, Amsterdam, The Netherlands, 11–14 October 2016, pp. 850–865.

Valmadre

Bertinetto

Henriques

, et al. End-to-end representation learning for correlation filter based tracking. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA 21–26 July, 2017, pp. 5000–5008. IEEE.

Wang

Teng

Xing

, et al. Learning attentions: residual attentional Siamese network for high performance online visual tracking. In: IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, UT, USA, 18–23 June, 2018, pp. 4854–4863. IEEE.

Fan

Ling

. Parallel tracking and verifying: a framework for real-time and high accuracy visual tracking. In: IEEE international conference on computer vision (ICCV), Venice, Italy, 22–29 October, 2017, pp. 4130–4144. IEEE.

Junjie

, et al. High performance visual tracking with Siamese region proposal network. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Salt Lake City, UT, USA, 18–23 June, 2018, pp. 8971–8980. IEEE.

Zhu

Wang

, et al. Distractor-aware Siamese networks for visual object tracking. In: European conference on computer vision (ECCV), Munich, Germany, 8–14 September, 2018,pp. 101–117.

Wang

, et al. SiamRPN++: evolution of Siamese visual tracking with very deep networks. In: IEEE/CVF conference on computer vision pattern recognition, Long Beach, CA, USA, 16–20 June 2019, pp. 4282–4291.

Zhang

Peng

Deeper and wider Siamese networks for real-time visual tracking. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Long Beach, CA, USA, 15 June, 2019, pp. 4586–4595.

Fan

Ling

. Siamese cascaded region proposal networks for real-time visual tracking. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Long Beach, CA, USA, 15 June 2019, pp. 7944–7953.

10.

Bhat

Danelljan

Gool

, et al. Learning discriminative model prediction for tracking. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, South Korea, 22 April 2019, pp. 6181–6190.

11.

Chen

Ouyang

, et al. Gradnet: gradient-guided network for visual object tracking. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, South Korea, 22 April 2019, pp. 6161–6170.

12.

Choi

Kwon

Lee

. Deep meta learning for real-time target-aware visual tracking. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, South Korea, 22 April 2019, pp. 911–920.

13.

Zhang

Gonzalez-Garcia

Weijer

, et al. Learning the model update for Siamese trackers. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, South Korea, 22 April 2019, pp. 4009–4018.

14.

Yan

Zhao

Wang

, et al. ‘Skimming-perusal’ tracking: a framework for real-time and robust long-term tracking. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 22 April 2019, pp. 2385–2393.

15.

Wang

Zhang

Bertinetto

, et al. Fast online object tracking and segmentation: a unifying approach. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019, pp. 1328–1338. IEEE.

16.

Gao

Wang

Xing

, et al. Tracking-by-fusion via Gaussian process regression extended to transfer learning. IEEE Trans Pattern Anal Mach Intellig (TPAMI) 2018; 42(4): 939–955.

17.

Chen

. Object tracking framework with Siamese network and re-detection mechanism. EURASIP J Wirel Commun Netw 2019; 261: 1–14.

18.

Kuai

Wen

. When correlation filters meet fully-convolutional Siamese networks for distractor-aware tracking. Signal Process Image Commun 2018; 64: 107–117. S0923596518302121.

19.

Longchao

Peilin . Robust real-time visual object tracking via multi-scale fully convolutional Siamese networks. In: Multimedia tools and applications, Springer, 13 April 2018.

20.

Portaz

Kohl

Chevallet

, et al. Object instance identification with fully convolutional networks. Multimedia Tools Appl 2019, 78(3): 2747–2764.

21.

Zhu

. Real-time object tracking based on improved fully-convolutional Siamese network. Comput Electri Eng 2020; 86(4): 106755.

22.

Chen

Yuan

Peng

, et al. DASNet: dual attentive fully convolutional Siamese networks for change detection of high resolution satellite images. IEEE J Select Topics Appl Earth Observ Remote Sensing 2021; 14: 1194–1206.

23.

Zhang

Yan

. Visual tracking using Siamese convolutional neural network with region proposal and domain specific updating. Neurocomputing 2018; 275: 2645–2655.

24.

Yang

Song

Zhang

. Deeper Siamese network with multi-level feature fusion for real-time visual tracking. Electr Lett 2019, 55(13): 742–745.

25.

Lim

Yang

. Online object tracking: a benchmark. In: IEEE conference on computer vision and pattern recognition (CVPR), Portland, Oregon, USA, 23–28 June, 2013, pp. 2411–2418.

26.

Lim

Yang

. Object tracking Benchmark. IEEE Trans Pattern Anal Mach Intellig 2015; 37(9): 1834–1848.

27.

Kristan

Leonardis

Matas

, et al. The visual object tracking VOT 2017 challenge results. In: IEEE international conference on computer vision (ICCV) Workshops, Venice, Italy, 22–29 October 2017, pp. 1949–1972.

28.

Kristan

Leonardis

Matas

, et al. The sixth visual object tracking VOT 2018 challenge results. In: The European conference on computer vision (ECCV) workshops, Munich, Germany, 8–14 September 2018, pp. 3–53.

29.

Lukezic

Zajc

Voj´ır

, et al. Now you see me: evaluating performance in long-term visual tracking. Volume abs/1804.07056, 2018, pp. 294–301.

30.

Krizhevsky

Sutskever

Hinton

. Imagenet classification with deep convolutional neural networks. In: NIPS, Lake Tahoe, Nevada, United States, 3–6 December 2012, pp. 1097–1105.

31.

Simonyan

Zisserman

. Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations (ICLR), Banff, Canada, 14–16 April 2014, pp. 105–122.

32.

Howard

Zhu

Chen

, et al. MobileNets: efficient convolutional neural networks for mobile vision applications. Volume abs/1704.04861, 2017, pp. 1282–1291.

33.

Held

Thrun

Savarese

Learning to track at 100 fps with deep regression networks. In: European conference on computer vision (ECCV), Amsterdam, The Netherlands, 8–16 October, 2016. pp. 749–765.

34.

Tao

Gavves

Smeulders

. Siamese instance search for tracking. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 27–30 June, 2016. pp. 1420–1429.

35.

Ren

Girshick

, et al. Faster R-CNN: towards real-time object detection with region proposal networks. In IEEE Trans Pattern Anal Mach Intellig, 2015; 39(6): 1137–1149.

36.

Dollár

Zitnick

. Fast edge detection using structured forests. IEEE Trans Pattern Anal Mach Intellig, 2014; 37(8): 1558–1570.

37.

Ming

Huang

, et al. Object-level proposals. In: IEEE international conference on computer vision (ICCV), Venice, Italy , 22–29 October, 2017. pp. 4931–4939.

38.

Zhou

Ming

, et al. Objectness-aware tracking via double-layer model. In: IEEE international conference on image (ICIP), Athens, Greece, 7–10 October, 2018. pp. 3713–3717.

39.

Zitnick

Dollar

. Edge boxes: locating object proposals from edges. In: European conference on computer vision (ECCV), Zurich, Switzerland, 6–12 September 2014, pp. 391–405.

40.

Tang

Wang

, et al. Weakly supervised region proposal network and object detection. In: European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018, pp. 352–368.

41.

Russakovsky

Deng

, et al. ImageNet large scale visual recognition challenge. Int J Comput Vision (IJCV) 2015; 115: 211–252.

42.

Woo

Park

Lee

, et al. CBAM: convolutional block attention module. In: European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018, pp. 3–19.

43.

Lin

Maire

Belongie

, et al. Microsoft COCO: common objects in context. In: European conference on computer vision (ECCV), Zurich, Switzerland, 6–12 September 2014, pp. 740–755.

44.

Real

Shlens

Mazzocchi

, et al. Vanhoucke, ‘YouTube-BoundingBoxes: a large high-precision human-annotated data set for object detection in video.’ In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 21–26 July, 2017. pp. 7464–7473.