Remember like humans

Abstract

Visual tracking is a challenging computer vision task due to the significant observation changes of the target. By contrast, the tracking task is relatively easy for humans. In this article, we propose a tracker inspired by the cognitive psychological memory mechanism, which decomposes the tracking task into sensory memory register, short-term memory tracker, and long-term memory tracker like humans. The sensory memory register captures information with three-dimensional perception; the short-term memory tracker builds the highly plastic observation model via memory rehearsal; the long-term memory tracker builds the highly stable observation model via memory encoding and retrieval. With the cooperative models, the tracker can easily handle various tracking scenarios. In addition, an appearance-shape learning method is proposed to update the two-dimensional appearance model and three-dimensional shape model appropriately. Extensive experimental results on a large-scale benchmark data set demonstrate that the proposed method outperforms the state-of-the-art two-dimensional and three-dimensional trackers in terms of efficiency, accuracy, and robustness.

Keywords

Visual object tracking biologically inspired vision cognitive psychological memory model 3-D perception.

Introduction

Visual object tracking is one of the most fundamental problems in computer vision with numerous applications such as intelligent surveillance, robot environment perception, and augmented reality. In the visual tracking task, an unknown target object specified in the first frame should be tracked in the subsequent frames. Despite significant progresses in the last decades,^1
–3 it is still challenging due to illumination variation, deformation, abrupt motion, and occlusion. By contrast, visual tracking is relatively easy for humans. The key component of a visual tracker is the online object modeling; correspondingly, humans perceive the environment with three-dimensional (3-D) stereo vision and remember (model) the 3-D object effectively by using the biological memory. In this article, we consider to exploit the biological memory of humans to overcome the visual tracking challenges mentioned above.

The cognitive psychological memory model asserts that human memory system has three separate components: sensory memory, short-term memory, and long-term memory. ^4,5 In the sensory memory, environmental information enters the memory system and external stimulus is detected, held, and sent to the short-term memory. In the short-term memory, the attended information is rehearsed, after which the memory system can generate an immediate and appropriate response to the stimulus. The information inside the short-term memory represents the events with high plasticity. However, the short-term memory does not hold the information for a long duration. In the long-term memory, the repeatedly received short-term information is encoded. The duration and capacity of the long-term memory are assumed to be nearly limitless, which means the remembered information can be maintained for a certain period of time. Meanwhile, the information stored in the long-term memory is retrieved in the short-term memory and the failed retrieved information is forgotten. The information inside the long-term memory represents the events with a high stability.

In visual object tracking, the most challenging problem is the stability–plasticity dilemma,⁶ and the tracker should remain adaptive (plastic) in response to significant input, yet remain stable in response to irrelevant input. Specifically, the tracker needs to adapt to the appearance changes of the target during tracking. With high adaptivity, the tracker is sensitive to target variations but easily corroded by noisy information from the background, which will lead to model drift problem and tracking failure; with low adaptively, the tracker is robust to the noise but insensitive to new appearance of the target, which will lead to model invalidation problem and tracking failure. Therefore, maintaining a proper adaptivity is the key of good tracking performance.

In this article, we propose a cognitive psychological memory model–based tracking (CPMT) algorithm to address the stability–plasticity dilemma mentioned above. Corresponding to the human memory system, the CPMT represents the target object with sensory memory register, short-term memory tracker (SMT), and long-term memory tracker (LMT). The sensory memory register perceives 3-D information like humans instead of monocular visual information, and the target can be described more accurately with the additional depth cue. During tracking, the 3-D information is acquired by RGB-D sensors (e.g. Microsoft Kinect), and both two-dimensional (2-D) appearance model and 3-D shape model of the target are built simultaneously. The plasticity of the SMT and the stability of the LMT are collaborated together through the encoding and retrieval processes of cognitive psychological memory model, which achieves both plastic and stable performance against the stability–plasticity dilemma. The SMT uses a kernelized correlation filter to model the target, and the rehearsal process is implemented by frame-to-frame linear model interpolation. The LMT employs a nearest neighbor classifier to model the target. An appearance-shape learning (A-S learning) method is proposed to realize the encoding process, which considers both the 2-D appearance variation and 3-D shape variation of the target. The retrieval processes is realized by long-term memory scoring and the failed retrieved submodel is regarded as modeling failure and forgotten.

This study makes three main contributions. First, a CPMT framework is proposed to solve the stability–plasticity dilemma in visual tracking, where the plasticity of the short-term memory and the stability of the long-term memory are integrated by the encoding and retrieval processes. Second, an A-S learning method is proposed, and the target’s 2-D appearance model and 3-D shape model are updated accurately in a complementary manner. Third, extensive experiments are performed on a large tracking benchmark data set⁷ with 100 challenging videos, and the efficiency, accuracy, and robustness of the proposed algorithm are demonstrated against state-of-the-art RGB and RGB-D trackers.

The rest of the article is organized as follows: “Related works” section reviews the research related to our work. “The proposed tacker” section introduces the proposed CPMT algorithm for visual tracking. “Experiments” section presents the experimental evaluations results of our CPMT algorithm, and the final section provides the conclusion.

Related works

The visual tracking algorithms can be divided into two main categories: generative trackers and discriminative trackers. In this section, we review the two categories of algorithms ranging from RGB tracking to RGB-D tracking. The RGB tracking is a traditional tracking algorithm where only the color frame stream is acquired during tracking. By constant, the RGB-D tracking is a new research area of visual tracking due to the availability of affordable and reliable RGB-D sensors in recent years,⁸ where both the color and depth frame streams are provided. There are only a limited number of RGB-D trackers proposed due to the novelty of the search area. In addition, we review the LMTs which are closely related to our study.

Generative tracking

Generative trackers regard the visual object tracking as a target matching task. The candidate that is most similar with the target observation model is decided as the target. In RGB tracking, various generative models were proposed to build the target observation model such as mean-shift,⁹ fragment-based,¹⁰ principal components analysis,¹¹ sparse coding,¹² and dictionary learning.¹³ In RGB-D tracking, Meshgi et al.¹⁴ proposed an occlusion-aware particle filter framework to deal with complex and persistent occlusions during tracking. In the probabilistic model, each particle is equipped with an occlusion flag variable and occlusion is detected when the amount of particles labeled as occluded is large enough. The algorithm uses multiple features extracted from both the color and depth frame to achieve robust target representation. Bibi et al.¹⁵ presented a 3-D tracker with part-based sparse coding observation model. The tracker searches the target by using the particle filter framework with 3-D observation model and motion model. In addition, this work considers the synchronization and registration noises during RGB-D frame capture and proposed automated methods to eliminate the noises. In the tracking process, the color and depth frames are synchronized and registered before the running of the proposed 3-D tracker.

Discriminative tracking

Discriminative trackers consider the visual object tracking as a binary classification task that distinguishes the target from the background. During tracking, positive and negative examples that denote target and background, respectively, are sampled to train the target classifier. In RGB tracking, many advanced techniques have been applied to the discriminative trackers, including multiple-instance learning,¹⁶ boosting,^17,18 structured output support vector machine (SVM),¹⁹ correlation filter,^20

–23 multiple experts entropy minimization,²⁴ and deep learning.^25

–28 In RGB-D tracking, Camplani et al.²⁹ built the target observation model using a kernelized correlation filter combined with color and depth features. The depth cue is additionally employed to estimate the scale change of the target and detect occlusion when the depth histogram of the target changes suddenly. García et al.³⁰ proposed a RGB-D tracker based on the condensation algorithm. The observation model is represented by a boosting classifier trained from a feature pool extracted from gray scale, color, and depth frames. The 3-D state space is defined to improve the accuracy of the particle filter’s predictions.

Long-term memory-based tracking

Some long-term memory-based tracking algorithms were proposed in recent years, where long-term memory is used to avoid model drifts and redetect the target when tracking failures occur. In RGB tracking, Kalal et al.³¹ decomposed the long-term tracking task into tracking, learning, and detection. Two experts were designed in learning to estimate the missed detections and false alarms. Supancic and Ramanan³² employed the self-paced curriculum learning to automatically select right frames for appearance model updating. Ma et al.³³ trained an online detector besides the correlation tracker to redetect the target when tracking failure occurred. Hong et al.³⁴ proposed a multistore tracker inspired by the Atkinson–Shiffrin memory model. The long-term store can provide additional information for output control, which is realized by keypoint matching tracking and Random Sample Consensus estimation. Wang et al.³⁵ explore and memorize reliable memories from previous frames via a clustering method with temporal constraints, which can utilize uncontaminated information to alleviate drifting issues. In RGB-D tracking, Song and Xiao⁷ proposed a long-term RGB-D tracker, in which an SVM detector is trained using histogram of oriented gradient (HOG) features extracted from color and depth frames. The target is detected by the SVM detector and tracked by the large displacement optical flow simultaneously during tracking; the candidate with the highest classifier score is determined as the target. Occlusion is estimated by assuming that the target is the closest object in its corresponding bounding box.

The proposed tracker

To address the stability–plasticity dilemma in visual tracking, the proposed CPMT decomposes the tracking task into sensory memory register, SMT, and LMT corresponding to the human memory model. The flowchart of CPMT is shown in Figure 1. The sensory memory register captures both color and depth frames from the environment. 3-D information is obtained like humans; hence, both the 2-D appearance model and 3-D shape model of the target can be built to describe the target more accurately. The SMT has high plasticity due to the rehearsal process, by which the latest observation of the target can be modeled immediately. However, it is sensitive to noisy information and the model is easy to be corroded. The LMT, by contrast, has high stability due to the encoding and retrieval processes, and the conservative model updating mechanism makes it insensitive to noises. In CPMT, the two trackers collaborate with each other: in continuous and steady scenarios, the SMT responses fast and adapts to the target observation immediately; in drastic changing scenarios, the LMT generates stable response and filters out the environmental noises.

Figure 1.

Flowchart of the proposed tracking algorithm. The tracking task into three components like humans: The sensory memory register captures 3-D information from the environment, and 2-D appearance model and 3-D shape model are built simultaneously during tracking; the short-term memory tracker models the target via rehearsal process; and the long-term memory tracker models the target via encoding and retrieval processes. If tracking failure is estimated in short-term tracking, the long-term tracking is performed to redetect the target in the environment. Both the 2-D appearance model and 3-D shape model are built during tracking. Best viewed in color with high-resolution display. 3-D: three-dimensional; 2-D: two-dimensional.

Short-term memory tracker

The SMT tracks the target fast in continuous frames by using the short-term memory. For efficient performance, the short-term memory of SMT is realized by the correlation filter.

Rehearsal with correlation filter

The correlation filter^20,21 models the target with a filter w, which is updated frame-by-frame to rehearse the latest target observation. The filter is trained on an M × N image patch x, which is the attention area l_t times larger than the target. All the circular shift samples $x_{m, n}$ , $(m, n) \in {0, 1, \dots, M - 1} \times {0, 1, \dots, N - 1}$ with Gaussian labels $y_{m, n} = \exp (- \frac{{(m - M / 2)}^{2} + {(n - N / 2)}^{2}}{2 σ_{l}^{2}})$ are employed to train the filter. Given a sample $x_{m, n}$ , the convolution response of the filter w can be calculated as $f (x_{m, n}) = w \cdot φ (x_{m, n})$ , where φ denotes the kernel space mapping. The filter is trained by minimizing the L²-error

w = \underset{w}{argmin} \sum_{m, n} {(f (x_{m, n}) - y_{m, n})}^{2} + λ ∥ w ∥^{2}

where λ is the regularization parameter. Using the kernel trick, the filter can be computed as $w = \sum_{m, n} α_{m, n} φ (x_{m, n})$ with a kernel $κ (x, x') = φ (x) \cdot φ (x')$ , where α is the dual variable of w. The SMT employs the shift invariant Gaussian kernel and α can be computed efficiently by using the fast Fourier transformation (FFT) in the Fourier domain

\hat{α} = \frac{\hat{y}}{{\hat{k}}^{x x} + λ}

where “⁁” denotes the FFT operator and $k^{x x}$ is a kernel matrix with elements $k^{x x} (m, n) = κ (x_{m, n}, x)$ . The filter is updated and rehearsed with the new target frame-by-frame

x_{m}^{t} = (1 - γ) x_{m}^{t - 1} + γ x^{t}

α_{m}^{t} = (1 - γ) α_{m}^{t - 1} + γ α^{t}

where t is the frame index, x_m and α_m denote the target observation model, and γ is the short-term memory forgetting factor.

Tracking with short-term memory

During short-term tracking, an M × N image patch z is extracted as the search window at the target’s previous location. The response map can be obtained by evaluating all cyclic shift patches of z

f (z) = F^{- 1} ({\hat{k}}^{x_{m} z} Ọ {\hat{α}}_{m})

The location of the maximal value of $f (z)$ is estimated as the target’s location. In SMT, the search window is sampled with multiple scales, and the scale where response peak exists is estimated as the scale of the target. HOG features³⁶ are extracted from color and depth frames to model both the 2-D appearance edge information and 3-D shape edge information of the target.

Long-term memory tracker

The LMT tracks the target in unsteady scenarios where the SMT may drift and fail. LMT maintains a stable target model in the long-term memory via encoding and retrieval, which are the two key processes of the cognitive psychological memory model. Specifically, an A-S learning method is proposed to realize the processes of encoding and retrieval.

Encoding and retrieval by A-S learning

The long-term memory in LMT is represented by the observations of the target and the background so far. It consists of 2-D appearance model $M_{a} = {c_{1}^{+}, c_{2}^{+}, \dots, c_{m_{a}}^{+}, c_{1}^{-}, c_{2}^{-}, \dots, c_{n_{a}}^{-}}$ and 3-D shape model $M_{s} = {d_{1}^{+}, d_{2}^{+}, \dots, d_{m_{s}}^{+}, d_{1}^{-}, d_{2}^{-}, \dots, d_{n_{s}}^{-}}$ , where c⁺ and d⁺ represent the color and depth patches belong to the target, c⁻ and d⁻ represent the color and depth patches belong to the background. The 2-D appearance model describes visual properties such as color, edge, and texture; the 3-D shape model describes depth distribution properties of the 3-D point clouds. The appearance model and shape model are shown in Figure 1. Given a color patch c, the similarity is calculated by $S_{a} (c, M_{a}) = \frac{S_{a}^{+}}{S_{a}^{+} + S_{a}^{-}}$ , where $S_{a}^{+}$ denotes the normalized correlation coefficient (NCC) with the positive nearest neighbor in M_a and $S_{a}^{-}$ denotes the NCC with the negative nearest neighbor in M_a. Correspondingly, similarity of a depth patch d is calculated by $S_{s} (d, M_{s}) = \frac{S_{s}^{+}}{S_{s}^{+} + S_{s}^{-}}$ . The similarities are used to define the appearance and shape nearest neighbor classifiers with thresholds θ_a and θ_s; a patch is classified as positive if the similarity exceeds the threshold.

Encoding process and retrieval process are of prime importance in the human memory system. Since the short-term memory does not retain for a long time and is easy to be covered by new arrival information, the long-term memory of LMT uses the encoding process to remember the repeatedly received target observation. Meanwhile, observation noises may exist in the long-term memory as well, which will decrease the performance of LMT. During tracking, results of SMT are retrieved in the long-term memory frame-to-frame, memory of the successfully retrieved target observations is enhanced, and memory of failed retrieved target observations is forgotten. The forgetting mechanism enables the LMT to eliminate the noises in its long-term memory.

LMT employs a novelty A-S learning method to encode and retrieve the long-term memory. A-S learning is based on the hypothesis that the variation of target’s observation should not be drastic in 2-D appearance and 3-D shape simultaneously. For instance, when illumination change occurs in the scenario, the 2-D appearance may vary drastically due to the projection effect, but the 3-D shape may stay invariant since the point clouds capture does not depend on the illumination; by contrast, when the target rotates in front of the camera, its 3-D shape may vary due to the change of view and its 2-D appearance may merely vary a little since the color and texture on the surface are invariant. In the encoding process, enabling of the appearance learning L_a and shape learning L_s is set as follows

(L_{a}, L_{s}) = {\begin{array}{l} ( & True & , & False & ), & S_{a} (c, M_{a}) < θ_{a} ⁁ S_{s} (d, M_{s}) > θ_{s} \\ ( & False & , & True & ), & S_{a} (c, M_{a}) > θ_{a} ⁁ S_{s} (d, M_{s}) < θ_{s} \\ ( & False & , & False & ), & otherwise \end{array}

In the retrieval process, the positive nearest neighbors $c_{n n}^{+}$ and $d_{n n}^{+}$ and the negative neighbors $c_{n n}^{-}$ and $d_{n n}^{-}$ are retrieved in the long-term memory with a weight enhancement. If the weight of a memory patch in M_a or M_s is below θ_f/N, where θ_f denotes the long-term memory forgetting threshold and N denotes the number of positive or negative memory patches, it is eliminated and forgotten from the corresponding long-term memory.

Tracking with long-term memory

During tracking, result generated by SMT is evaluated by the long-term memory of LMT. The evaluation is based on the appearance-shape validation (A-S validation): If $S_{a} (c, M_{a}) > θ_{a} \lor S_{s} (d, M_{s}) > θ_{s}$ , the short-term tracking result is validated, otherwise short-term tracking failure is estimated and the target is redetected by long-term tracking. The long-term tracking works in a cascaded manner, and candidates are successively evaluated by random fern classifiers,³¹ which are trained simultaneously in the A-S learning, and the nearest neighbor classifiers. The candidate passes the A-S validation is estimated as the target.

Experiments

To demonstrate the efficiency of the proposed algorithm, we evaluate it on a large RGB-D tracking benchmark data set.⁷ First, we test and analyze the proposed CPMT framework. Next, we compare the proposed CPMT tracker with state-of-the-art RGB and RGB-D trackers.

Experimental setups

Our algorithm is implemented in native MATLAB without optimization. The experiments are performed on an Intel I5-2400 3.10 GHz CPU with 4 GB RAM.

Implementation details

In SMT, the size of attention area is set to $l_{t} = 1.5$ , the regularization parameter in filter training is set to $λ {= 10}^{- 4}$ , the width of the Gaussian kernel is set to σ = 0.5, and the short-term memory forgetting factor is set to γ = 0.05. In LMT, thresholds of the 2-D appearance nearest neighbor classifier and 3-D shape nearest neighbor classifier are set to θ_a = 0.5 and θ_s = 0.5, respectively, and the forgotten threshold of long-term memory is set to θ_f = 0.2. Occlusion is estimated when the percentage of point clouds in front of the target exceeds 10%.

Evaluation data set

The Princeton Tracking Benchmark (PTB)⁷ is used to evaluate our algorithm. The PTB data set contains 100 RGB-D videos and allows the evaluation of both 2-D visual trackers and 3-D visual trackers. The videos in the data set are annotated with 11 attributes according to target type (human, animal, and rigid), target size (large and small), movement (slow and fast), occlusion (yes and no), and motion type (passive and active), indicating different challenges in the visual tracking task. To ensure fair evaluation and comparison with different trackers, ground truths of the data set are reserved to prevent data-specific parameter tuning. To evaluate a tracker, tracking results of all videos in the data set should be packaged and submitted to the website of the data set (http://tracking.cs.princeton.edu), then the evaluation and comparison results are automatically generated online for the tracker.

Evaluation methodology

We employ the evaluation method in PTB⁷ to quantitatively evaluate the performance of the proposed algorithm, where the average success rate metric is used. The metric is defined as the area under the curve of the tracker’s success plot, which is generated by changing the overlap threshold of success tracking judgment from 0 to 1 and recording the percentage of successful frames. The overlap between the tracking result and the ground truth is defined as follows

O_{i} = {\begin{array}{l} \frac{area (B_{R_{i}} \cap B_{T_{i}})}{area (B_{R_{i}} \cup B_{T_{i}})} & if both B_{R_{i}} and B_{T_{i}} exist \\ 1 & if neither B_{R_{i}} nor B_{T_{i}} exist \\ - 1 & otherwise \end{array}

where $B_{R_{i}}$ is the tracking result bounding box in frame i and $B_{T_{i}}$ is the ground truth bounding box.

Algorithm analysis

In the proposed CPMT tracker, the SMT and LMT collaborate with each other through the encoding and retrieval processes of cognitive psychological memory model, which provides high plasticity and stability simultaneously. To demonstrate the efficiency of the CPMT, we additionally test its short-term component SMT in the PTB data set and compare the performance between them.

As shown in Figure 2, CPMT performs better than SMT with 7.6% average success rate gain. In addition, CPMT outperforms SMT in all 11 tracking attributes. Specifically, the performance improvement is significant in occlusion (14.3%), passive (12.9%), and human (9.6%). In occlusion scenario, observation of the target is covered by the occlusion; in passive scenario, motion of the target is irregular and abrupt motion may happen; in human scenario, drastic observation variations may occur due to the deformability of the target. Therefore, in the above scenarios, the continuities of the tracking task are broken and noises are brought into the observation model. The SMT models the target’s observation with high adaptation short-term memory, which merely works well in continuous conditions and sensitive to noises. By contrast, the CPMT additionally models the target with the long-term memory, which is stable in scenarios with drastic variations and robust to noises. The collaboration of short-term memory and long-term memory achieves both plasticity and stability for the CPMT. In addition, the A-S learning method in long-term memory builds the 2-D appearance model and 3-D shape model of the target in a complementary manner, which makes the model update robust to variations in the scenario.

Figure 2.

Comparison of tracker employing the SMT with tracker employing the collaborative short-term and long-term memory (CPMT), CPMT outperforms SMT in all tracking scenarios. Best viewed in color. SMT: short-term memory tracker; CPMT: cognitive psychological memory model-based tracking.

State-of-the-art comparison

We compare the proposed CPMT tracker with both state-of-the-art RGB trackers and state-of-the-art RGB-D trackers. The RGB tracker includes MEEM,²⁴ KCF,²¹ CN2,²² and Struck;¹⁹ the RGB-D tracker includes OAPF,¹⁴ PST,¹⁵ PrinT,⁷ and DS-KCF.²⁹ The comparison results are shown in Table 1 and Figure 3.

Table 1.

Experimental results of state-of-the-art comparison on the PTB.^a

Algorithm	Average rank	All SRs	Target type			Target size		Movement		Occlusion		Motion type
Algorithm	Average rank	All SRs	Human	Animal	Rigid	Large	Small	Slow	Fast	Yes	No	Passive	Active
CPMT (ours)^b	1.27	79.6	84.7 (1)	69.7 (2)	79.1 (1)	84.1 (1)	76.3 (1)	79.1 (2)	79.9 (1)	75.3 (1)	85.5 (1)	81.5 (2)	78.9 (1)
OAPF^b14	3.09	73.1	64.2 (5)	84.8 (1)	77.2 (3)	72.7 (4)	73.4 (2)	85.1 (1)	68.4 (4)	64.4 (4)	85.1 (2)	77.7 (5)	71.4 (3)
PST^b15	3.18	75.0	81.4 (2)	64.2 (4)	73.3 (6)	79.9 (2)	71.2 (3)	75.1 (5)	74.9 (2)	72.5 (2)	78.3 (4)	79.0 (3)	73.5 (2)
RGBDOcc+OF(a)*^b	3.36	73.3	74.0 (3)	62.6 (5)	78.4 (2)	78.1 (3)	69.7 (4)	76.3 (3)	72.2 (3)	72.0 (3)	75.2 (6)	82.3 (1)	70.0 (4)
DS-KCF^b29	4.91	69.3	67.0 (4)	61.2 (6)	76.4 (4)	68.8 (6)	69.7 (5)	75.4 (4)	66.9 (5)	63.3 (5)	77.6 (5)	78.8 (4)	65.7 (6)
RGBD+OF(b)*^b	5.36	68.1	63.9 (6)	65.3 (3)	74.5 (5)	71.5 (5)	65.5 (6)	73.4 (7)	65.9 (6)	60.1 (6)	79.0 (3)	74.0 (7)	65.8 (5)
PCdet-flow(c)*^b	8.27	58.9	50.5 (9)	51.6 (9)	72.7 (7)	63.4 (7)	55.5 (10)	73.9 (6)	53.0 (8)	55.0 (7)	64.4 (13)	75.5 (6)	52.7 (9)
ASKCF-OHT^#b	8.45	58.9	51.7 (7)	49.8 (13)	72.2 (8)	58.7 (8)	59.0 (7)	67.1 (9)	55.6 (7)	52.1 (8)	68.1 (11)	71.9 (8)	54.0 (7)
MEEM²⁴	8.82	57.2	51.0 (8)	51.0 (10)	68.0 (9)	58.0 (9)	56.0 (9)	68.0 (8)	53.0 (9)	46.0 (11)	72.0 (7)	69.0 (9)	53.0 (8)
SAMF+Depth^b37	11.36	54.0	44.8 (13)	49.6 (14)	67.0 (10)	52.4 (12)	55.2 (11)	65.2 (10)	49.5 (13)	41.1 (12)	71.6 (8)	66.0 (10)	49.3 (12)
LDP-SVT^#b	12.00	52.5	46.8 (11)	58.8 (8)	55.8 (15)	52.9 (11)	52.3 (13)	56.3 (17)	51.1 (11)	41.0 (13)	68.4 (9)	58.2 (14)	50.4 (10)
RGBD(d)^b*	12.91	53.2	47.1 (10)	47.0 (17)	63.6 (12)	47.4 (16)	57.5 (8)	56.7 (16)	51.8 (10)	46.9 (9)	61.9 (18)	63.4 (13)	49.3 (13)
KCF²¹	12.91	52.0	41.8 (16)	50.4 (11)	64.9 (11)	48.4 (11)	54.7 (12)	65.0 (11)	46.9 (15)	40.6 (14)	67.7 (12)	64.5 (12)	47.3 (14)
LDPSTRUCK^b38	13.09	51.8	46.2 (12)	59.1 (7)	54.5 (17)	51.6 (13)	52.0 (14)	56.2 (18)	50.1 (12)	39.8 (15)	68.4 (10)	56.4 (15)	50.1 (11)
PCdet(e)*^b	15.09	48.7	40.6 (17)	42.1 (20)	61.7 (13)	55.4 (10)	43.6 (20)	58.5 (12)	44.8 (16)	46.3 (10)	52.0 (20)	64.9 (11)	42.6 (17)
Dhog(f)*^b	15.82	49.0	43.3 (14)	48.3 (16)	55.9 (14)	47.2 (17)	50.3 (15)	52.7 (19)	47.5 (14)	38.4 (16)	63.5 (15)	54.3 (19)	46.9 (15)
CN2²²	16.18	47.1	42.0 (15)	50.0 (12)	51.0 (20)	48.0 (15)	46.0 (17)	57.0 (15)	43.0 (17)	35.0 (17)	64.0 (14)	52.0 (20)	45.0 (16)
Struck¹⁹	17.82	44.4	35.4 (18)	47.0 (18)	53.4 (19)	45.0 (18)	43.9 (19)	58.0 (13)	39.0 (18)	30.4 (21)	63.5 (16)	54.4 (18)	40.6 (18)
VTD³⁹	18.18	43.0	30.9 (22)	48.8 (15)	53.9 (18)	38.6 (21)	46.2 (16)	57.3 (14)	37.2 (19)	28.3 (22)	63.1 (17)	54.9 (17)	38.5 (19)
RGB(g)*	20.18	39.9	26.7 (24)	40.9 (21)	54.7 (16)	31.9 (24)	46.0 (18)	50.5 (21)	35.7 (20)	34.8 (18)	46.8 (22)	56.2 (16)	33.7 (22)
CT⁴⁰	21.64	36.4	31.1 (21)	46.7 (19)	36.9 (24)	39.0 (20)	34.4 (23)	48.6 (22)	31.5 (22)	23.3 (25)	54.3 (19)	42.1 (22)	34.2 (21)
PCflow(h)*^b	21.82	37.1	35.2 (19)	29.1 (25)	43.6 (22)	42.2 (19)	33.2 (24)	47.2 (23)	33.1 (21)	32.4 (20)	43.5 (23)	41.3 (24)	35.5 (20)
TLD³¹	22.09	35.9	29.0 (23)	35.1 (23)	44.4 (21)	32.5 (23)	38.5 (21)	51.6 (20)	29.7 (24)	33.8 (19)	38.7 (24)	50.2 (21)	30.5 (24)
MIL¹⁶	22.55	35.5	32.2 (20)	37.2 (22)	38.3 (23)	36.6 (22)	34.6 (22)	45.5 (24)	31.5 (23)	25.6 (23)	49.0 (21)	40.4 (25)	33.6 (23)
SemiB¹⁸	24.64	28.3	22.5 (25)	33.0 (24)	32.7 (25)	24.0 (25)	31.6 (25)	38.2 (25)	24.4 (25)	25.1 (24)	32.7 (25)	41.9 (23)	23.2 (25)
OF(i)*	26.00	18.6	17.9 (26)	11.4 (26)	23.4 (26)	20.1 (26)	17.5 (26)	18.1 (26)	18.8 (26)	15.9 (26)	22.3 (26)	23.4 (26)	16.8 (26)

PTB: Princeton Tracking Benchmark; CPMT: cognitive psychological memory model-based tracking; SR: successful rate; HOG: histogram of oriented gradient.

^aAverage SRs and rankings (in parentheses) are presented under different attributes. The best and the second best results are in red and blue, respectively.

^bThese trackers take advantage of depth (3-D) information. *These trackers are proposed by PrinT:⁷ (a) RGBD HOG detection + optical flow + occlusion handling; (b) RGBD HOG detection + optical flow; (c) point cloud detection + optical flow; (d) RGBD HOG detection; (e) point cloud detection; (f) depth HOG detection; (g) RGB HOG detection; (h) point cloud optical flow; (i) optical flow.

^#These trackers are benchmarked on the data set but unpublished.

Figure 3.

Qualitative evaluation of the top five trackers on PTB: the proposed CPMT, OAPF,¹⁴ PST,¹⁵ PrinT–(a)RGBDOcc+OF,⁷ and DS-KCF.²⁹ Videos from top to down and left to right are basketball2, bdog_occ2, flower_red_occ, libary2.1_occ, new_ex_occ2, new_student_center3, toy_green_occ, two_people_1.3, wuguiTwo_no, and zball_no2, respectively. Our algorithm performs consistently against state-of-the-art trackers. Best viewed in color with high-resolution display. PTB: Princeton Tracking Benchmark; CPMT: cognitive psychological memory model-based tracking.

The average rank and all SR columns in Table 1 show that the proposed CMPT tracker outperforms all RGB trackers and all RGB-D trackers with 8.9% average success rate improvement against the second best tracker. In the attribute-based comparison, the CPMT performs either best or second best in all the 11 attributes. Specifically, the CPMT performs best in 8 of the 11 attributes, for example, human, rigid, large, small, fast, occ, no-occ, and active; the CPMT performs the second best in the other three attributes, for example, animal, slow, and passive. In particular, CPMT performs much better than the second best tracker in active (7.3%), fast(6.7%), and large (5.3%). Additionally, from Table 1, the average performance of RGB-D trackers is much better than the RGB trackers due to the usage of 3-D information. The tracking performance improvements in the comparison results demonstrate that the CPMT tracker can perform robustly in a large range of scenarios. This is because the CPMT models the target’s observation with two complementary models like humans: the short-term memory model achieves high plasticity and the long-term memory model achieves high stability. The collaboration of the two models solves the stability–plasticity dilemma in visual tracking, which makes the CPMT able to adapt to tracking challenges with both fast and drastic variations.

The average frame rate of the proposed CMPT algorithm is 5.4 frames per second (fps). Its speed is much faster than the second best tracker OAPF (0.9 fps, 6× faster), the third best tracker PST (offline, don’t have online frame rate), and the fourth best tracker RGBDOcc+OF (0.1 fps, 54× faster). This is because the encoding and retrieval processes of the algorithm which transfer information between short-term memory and long-term memory are intuitive and efficient by using the proposed A-S learning method.

Conclusion

In this search, we propose a novel visual tracking algorithm (CPMT) inspired by the cognitive psychological memory mechanism. The CPMT decomposes the tracking task into three components like humans: the sensory memory register captures 3-D information from the environment, and 2-D appearance model and 3-D shape model are built simultaneously during tracking; the SMT models the target via rehearsal process with high plasticity; the LMT models the target via encoding and retrieval processes with high stability. Extensive experimental results on a large-scale RGB-D benchmark demonstrate that components of the biological-inspired framework collaborate with each other, and the proposed CPMT performs favorably against the state-of-the-art trackers in terms of efficiency, accuracy, and robustness.

Footnotes

Authors note

Ning An and Shi-Ying Sun are also affiliated to University of Chinese Academy of Sciences, Beijing, China.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the National Natural Science Foundation of China under Grant (61271432, 61673378, 61421004).

References

Yilmaz

Javed

Shah

. Object tracking: a survey. ACM Computing Surveys 2006; 38(4): 13.

Smeulders

AWM

Chu

Cucchiara

. Visual tracking: an experimental survey. IEEE T Pattern Anal 2014; 36(7): 1442–1468.

Lim

Yang

. Object tracking benchmark. IEEE Trans Pattern Anal 2015; 37(9): 1834–1848.

Atkinson

Shiffrin

. Human memory: a proposed system and its control processes. Psychol Learn Motiv 1968; 2: 89–195.

Hochreiter

Schmidhuber

. Long short-term memory. Neural Comput 1997; 9(8): 1735–1780.

Grossberg

. Competitive learning: from interactive activation to adaptive resonance. Cognitive Sci 1987; 11(1): 23–63.

Song

Xiao

. Tracking revisited using RGBD camera: unified benchmark and baselines. In: proceedings of the IEEE international conference on computer vision (ICCV), Sydney, Australia, 3–6 December 2013, pp. 233–240.

Han

Shao

. Enhanced computer vision with microsoft kinect sensor: a review. IEEE T Cyber 2013; 43(5): 1318–1334.

Comaniciu

Ramesh

Meer

. Kernel-based object tracking. IEEE Trans Pattern Anal 2003; 25(5): 564–577.

10.

Adam

Rivlin

Shimshoni

. Robust fragments-based tracking using the integral histogram. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), New York, United States, 17–22 June 2006, pp. 798–805.

11.

Ross

Lim

Lin

. Incremental learning for robust visual tracking. Int J Comput Vision 2008; 77(1-3): 125–141.

12.

Mei

Ling

. Robust visual tracking using l1 minimization. In: proceedings of the IEEE international conference on computer vision (ICCV), Kyoto, Japan, 29 September–2 October 2009, pp. 1436–1443.

13.

Wang

Yeung

. Online robust non-negative dictionary learning for visual tracking. In: proceedings of the IEEE international conference on computer vision (ICCV), Sydney, Australia, 3–6 December 2013, pp. 657–664.

14.

Meshgi

Maeda

Oba

. An occlusion-aware particle filter tracker to handle complex and persistent occlusions. Comput Vis Image Und 2016; 150: 81–94.

15.

Bibi

Zhang

Ghanem

. 3D part-based sparse tracker with automatic synchronization and registration. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, United States, 26 June–1 July 2016, pp. 1439–1448.

16.

Babenko

Yang

MH,

Belongie

. Robust object tracking with online multiple instance learning. IEEE T Pattern Anal 2011; 33(8): 1619–1632.

17.

Grabner

Bischof

. Real-time tracking via on-line boosting. In: proceedings of the British machine vision conference (BMVC), Edinburgh, United Kingdom, 4–7 September 2006, pp. 47–56.

18.

Grabner

Leistner

Bischof

. Semi-supervised on-line boosting for robust tracking. In: proceedings of the European conference on computer vision (ECCV), Marseille, France, 12–18 October 2008, pp. 234–247.

19.

Hare

Saffari

Torr

PHS

. Struck: Structured output tracking with kernels. In: proceedings of the IEEE international conference on computer vision (ICCV), Barcelona, Spain, 6–13 November 2011, pp. 263–270.

20.

Bolme

Beveridge

Draper

. Visual object tracking using adaptive correlation filters. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), San Francisco, United States, 13–18 June 2010, pp. 2544–2550.

21.

Henriques

Caseiro

Martins

. High-speed tracking with kernelized correlation filters. IEEE T Pattern Anal 2015; 37(3): 583–596.

22.

Danelljan

Khan

Felsberg

. Adaptive color attributes for real-time visual tracking. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Columbus, United States, 24–27 June 2014, pp. 1090–1097.

23.

Danelljan

Hager

Shahbaz Khan

. Learning spatially regularized correlation filters for visual tracking. In: proceedings of the IEEE international conference on computer vision (ICCV), Santiago, Chile, 13–16 December 2015, pp. 4310–4318.

24.

Zhang

Sclaroff

. MEEM: robust tracking via multiple experts using entropy minimization. In: proceedings of the European conference on computer vision (ECCV), Zurich, Switzerland, 6–12 September 2014, pp. 188–203.

25.

Wang

Yeung

. Learning a deep compact image representation for visual tracking. In: proceedings of the conference on neural information processing systems (NIPS), Lake Tahoe, United States, 5–10 December 2013, pp. 809–817.

26.

Huang

Yang

. Hierarchical convolutional features for visual tracking. In: proceedings of the IEEE international conference on computer vision (ICCV), Santiago, Chile, 13–16 December 2015, pp. 3074–3082.

27.

Wang

Ouyang

Wang

. Visual tracking with fully convolutional networks. In: proceedings of the IEEE international conference on computer vision (ICCV), Santiago, Chile, 13–16 December 2015, pp. 3119–3127.

28.

Nam

Han

. Learning multi-domain convolutional neural networks for visual tracking. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, United States, 26 June–1 July 2016, pp. 4293–4302.

29.

Camplani

Hannuna

Mirmehdi

. Real-time RGB-D tracking with depth scaling kernelised correlation filters and occlusion handling. In: proceedings of the British machine vision conference (BMVC), Swansea, United Kingdom, 7–10 September 2015, pp. 145.1–145.11.

30.

García

Klein

Stückler

. Adaptive multi-cue 3D tracking of arbitrary objects. In: Pattern Recognition: Joint 34th DAGM and 36th OAGM Symposium, Graz, Austria, 28–31 August 2012, pp. 357–366.

31.

Kalal

Mikolajczyk

Matas

. Tracking-learning-detection. IEEE T Pattern Anal 2012; 34(7): 1409–1422.

32.

Supancic

Ramanan

. Self-paced learning for long-term tracking. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Portland, United States, 25–27 June 2013, pp. 2379–2386.

33.

Yang

Zhang

. Long-term correlation tracking. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Boston, United States, 7–12 June 2015, pp. 5388–5396.

34.

Hong

Chen

Wang

. Multi-store tracker (muster): a cognitive psychology inspired approach to object tracking. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Boston, United States, 7–12 June 2015, pp. 749–758.

35.

Wang

Zhang

Liu

. Visual tracking with reliable memories. In: proceedings of the international joint conference on artificial intelligence (IJCAI), New York, United States, 9–15 July 2016.

36.

Dalal

Triggs

. Histograms of oriented gradients for human detection. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), San Diego, United States, 20–25 June 2005, pp. 886–893.

37.

Zhu

. A scale adaptive kernel correlation filter tracker with feature integration. In: proceedings of the European conference on computer vision workshops (ECCV Workshops), Zurich, Switzerland, 6–12 September 2014, pp. 254–265.

38.

Awwad

Hussein

Piccardi

. Local depth patterns for tracking in depth videos. In: proceedings of the ACM international conference on multimedia, Brisbane, Australia, 26–30 October 2015, pp. 1115–1118.

39.

won

Lee

. Visual tracking decomposition. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), San Francisco, United States, 13–18 June 2010, pp. 1269–1276.

40.

Zhang

Yang

. Real-time compressive tracking. In: proceedings of the European conference on computer vision (ECCV), Florence, Italy, 7–13 October 2012, pp. 864–877.