Abstract
Visual tracking is a challenging computer vision task due to the significant observation changes of the target. By contrast, the tracking task is relatively easy for humans. In this article, we propose a tracker inspired by the cognitive psychological memory mechanism, which decomposes the tracking task into sensory memory register, short-term memory tracker, and long-term memory tracker like humans. The sensory memory register captures information with three-dimensional perception; the short-term memory tracker builds the highly plastic observation model via memory rehearsal; the long-term memory tracker builds the highly stable observation model via memory encoding and retrieval. With the cooperative models, the tracker can easily handle various tracking scenarios. In addition, an appearance-shape learning method is proposed to update the two-dimensional appearance model and three-dimensional shape model appropriately. Extensive experimental results on a large-scale benchmark data set demonstrate that the proposed method outperforms the state-of-the-art two-dimensional and three-dimensional trackers in terms of efficiency, accuracy, and robustness.
Keywords
Introduction
Visual object tracking is one of the most fundamental problems in computer vision with numerous applications such as intelligent surveillance, robot environment perception, and augmented reality. In the visual tracking task, an unknown target object specified in the first frame should be tracked in the subsequent frames. Despite significant progresses in the last decades, 1 –3 it is still challenging due to illumination variation, deformation, abrupt motion, and occlusion. By contrast, visual tracking is relatively easy for humans. The key component of a visual tracker is the online object modeling; correspondingly, humans perceive the environment with three-dimensional (3-D) stereo vision and remember (model) the 3-D object effectively by using the biological memory. In this article, we consider to exploit the biological memory of humans to overcome the visual tracking challenges mentioned above.
The cognitive psychological memory model asserts that human memory system has three separate components: sensory memory, short-term memory, and long-term memory. 4,5 In the sensory memory, environmental information enters the memory system and external stimulus is detected, held, and sent to the short-term memory. In the short-term memory, the attended information is rehearsed, after which the memory system can generate an immediate and appropriate response to the stimulus. The information inside the short-term memory represents the events with high plasticity. However, the short-term memory does not hold the information for a long duration. In the long-term memory, the repeatedly received short-term information is encoded. The duration and capacity of the long-term memory are assumed to be nearly limitless, which means the remembered information can be maintained for a certain period of time. Meanwhile, the information stored in the long-term memory is retrieved in the short-term memory and the failed retrieved information is forgotten. The information inside the long-term memory represents the events with a high stability.
In visual object tracking, the most challenging problem is the stability–plasticity dilemma, 6 and the tracker should remain adaptive (plastic) in response to significant input, yet remain stable in response to irrelevant input. Specifically, the tracker needs to adapt to the appearance changes of the target during tracking. With high adaptivity, the tracker is sensitive to target variations but easily corroded by noisy information from the background, which will lead to model drift problem and tracking failure; with low adaptively, the tracker is robust to the noise but insensitive to new appearance of the target, which will lead to model invalidation problem and tracking failure. Therefore, maintaining a proper adaptivity is the key of good tracking performance.
In this article, we propose a cognitive psychological memory model–based tracking (CPMT) algorithm to address the stability–plasticity dilemma mentioned above. Corresponding to the human memory system, the CPMT represents the target object with sensory memory register, short-term memory tracker (SMT), and long-term memory tracker (LMT). The sensory memory register perceives 3-D information like humans instead of monocular visual information, and the target can be described more accurately with the additional depth cue. During tracking, the 3-D information is acquired by RGB-D sensors (e.g. Microsoft Kinect), and both two-dimensional (2-D) appearance model and 3-D shape model of the target are built simultaneously. The plasticity of the SMT and the stability of the LMT are collaborated together through the encoding and retrieval processes of cognitive psychological memory model, which achieves both plastic and stable performance against the stability–plasticity dilemma. The SMT uses a kernelized correlation filter to model the target, and the rehearsal process is implemented by frame-to-frame linear model interpolation. The LMT employs a nearest neighbor classifier to model the target. An appearance-shape learning (A-S learning) method is proposed to realize the encoding process, which considers both the 2-D appearance variation and 3-D shape variation of the target. The retrieval processes is realized by long-term memory scoring and the failed retrieved submodel is regarded as modeling failure and forgotten.
This study makes three main contributions. First, a CPMT framework is proposed to solve the stability–plasticity dilemma in visual tracking, where the plasticity of the short-term memory and the stability of the long-term memory are integrated by the encoding and retrieval processes. Second, an A-S learning method is proposed, and the target’s 2-D appearance model and 3-D shape model are updated accurately in a complementary manner. Third, extensive experiments are performed on a large tracking benchmark data set 7 with 100 challenging videos, and the efficiency, accuracy, and robustness of the proposed algorithm are demonstrated against state-of-the-art RGB and RGB-D trackers.
The rest of the article is organized as follows: “Related works” section reviews the research related to our work. “The proposed tacker” section introduces the proposed CPMT algorithm for visual tracking. “Experiments” section presents the experimental evaluations results of our CPMT algorithm, and the final section provides the conclusion.
Related works
The visual tracking algorithms can be divided into two main categories: generative trackers and discriminative trackers. In this section, we review the two categories of algorithms ranging from RGB tracking to RGB-D tracking. The RGB tracking is a traditional tracking algorithm where only the color frame stream is acquired during tracking. By constant, the RGB-D tracking is a new research area of visual tracking due to the availability of affordable and reliable RGB-D sensors in recent years, 8 where both the color and depth frame streams are provided. There are only a limited number of RGB-D trackers proposed due to the novelty of the search area. In addition, we review the LMTs which are closely related to our study.
Generative tracking
Generative trackers regard the visual object tracking as a target matching task. The candidate that is most similar with the target observation model is decided as the target. In RGB tracking, various generative models were proposed to build the target observation model such as mean-shift, 9 fragment-based, 10 principal components analysis, 11 sparse coding, 12 and dictionary learning. 13 In RGB-D tracking, Meshgi et al. 14 proposed an occlusion-aware particle filter framework to deal with complex and persistent occlusions during tracking. In the probabilistic model, each particle is equipped with an occlusion flag variable and occlusion is detected when the amount of particles labeled as occluded is large enough. The algorithm uses multiple features extracted from both the color and depth frame to achieve robust target representation. Bibi et al. 15 presented a 3-D tracker with part-based sparse coding observation model. The tracker searches the target by using the particle filter framework with 3-D observation model and motion model. In addition, this work considers the synchronization and registration noises during RGB-D frame capture and proposed automated methods to eliminate the noises. In the tracking process, the color and depth frames are synchronized and registered before the running of the proposed 3-D tracker.
Discriminative tracking
Discriminative trackers consider the visual object tracking as a binary classification task that distinguishes the target from the background. During tracking, positive and negative examples that denote target and background, respectively, are sampled to train the target classifier. In RGB tracking, many advanced techniques have been applied to the discriminative trackers, including multiple-instance learning, 16 boosting, 17,18 structured output support vector machine (SVM), 19 correlation filter, 20 –23 multiple experts entropy minimization, 24 and deep learning. 25 –28 In RGB-D tracking, Camplani et al. 29 built the target observation model using a kernelized correlation filter combined with color and depth features. The depth cue is additionally employed to estimate the scale change of the target and detect occlusion when the depth histogram of the target changes suddenly. García et al. 30 proposed a RGB-D tracker based on the condensation algorithm. The observation model is represented by a boosting classifier trained from a feature pool extracted from gray scale, color, and depth frames. The 3-D state space is defined to improve the accuracy of the particle filter’s predictions.
Long-term memory-based tracking
Some long-term memory-based tracking algorithms were proposed in recent years, where long-term memory is used to avoid model drifts and redetect the target when tracking failures occur. In RGB tracking, Kalal et al. 31 decomposed the long-term tracking task into tracking, learning, and detection. Two experts were designed in learning to estimate the missed detections and false alarms. Supancic and Ramanan 32 employed the self-paced curriculum learning to automatically select right frames for appearance model updating. Ma et al. 33 trained an online detector besides the correlation tracker to redetect the target when tracking failure occurred. Hong et al. 34 proposed a multistore tracker inspired by the Atkinson–Shiffrin memory model. The long-term store can provide additional information for output control, which is realized by keypoint matching tracking and Random Sample Consensus estimation. Wang et al. 35 explore and memorize reliable memories from previous frames via a clustering method with temporal constraints, which can utilize uncontaminated information to alleviate drifting issues. In RGB-D tracking, Song and Xiao 7 proposed a long-term RGB-D tracker, in which an SVM detector is trained using histogram of oriented gradient (HOG) features extracted from color and depth frames. The target is detected by the SVM detector and tracked by the large displacement optical flow simultaneously during tracking; the candidate with the highest classifier score is determined as the target. Occlusion is estimated by assuming that the target is the closest object in its corresponding bounding box.
The proposed tracker
To address the stability–plasticity dilemma in visual tracking, the proposed CPMT decomposes the tracking task into sensory memory register, SMT, and LMT corresponding to the human memory model. The flowchart of CPMT is shown in Figure 1. The sensory memory register captures both color and depth frames from the environment. 3-D information is obtained like humans; hence, both the 2-D appearance model and 3-D shape model of the target can be built to describe the target more accurately. The SMT has high plasticity due to the rehearsal process, by which the latest observation of the target can be modeled immediately. However, it is sensitive to noisy information and the model is easy to be corroded. The LMT, by contrast, has high stability due to the encoding and retrieval processes, and the conservative model updating mechanism makes it insensitive to noises. In CPMT, the two trackers collaborate with each other: in continuous and steady scenarios, the SMT responses fast and adapts to the target observation immediately; in drastic changing scenarios, the LMT generates stable response and filters out the environmental noises.

Flowchart of the proposed tracking algorithm. The tracking task into three components like humans: The sensory memory register captures 3-D information from the environment, and 2-D appearance model and 3-D shape model are built simultaneously during tracking; the short-term memory tracker models the target via rehearsal process; and the long-term memory tracker models the target via encoding and retrieval processes. If tracking failure is estimated in short-term tracking, the long-term tracking is performed to redetect the target in the environment. Both the 2-D appearance model and 3-D shape model are built during tracking. Best viewed in color with high-resolution display. 3-D: three-dimensional; 2-D: two-dimensional.
Short-term memory tracker
The SMT tracks the target fast in continuous frames by using the short-term memory. For efficient performance, the short-term memory of SMT is realized by the correlation filter.
Rehearsal with correlation filter
The correlation filter
20,21
models the target with a filter
where λ is the regularization parameter. Using the kernel trick, the filter can be computed as
where “⁁” denotes the FFT operator and
where t is the frame index,
Tracking with short-term memory
During short-term tracking, an M × N image patch
The location of the maximal value of
Long-term memory tracker
The LMT tracks the target in unsteady scenarios where the SMT may drift and fail. LMT maintains a stable target model in the long-term memory via encoding and retrieval, which are the two key processes of the cognitive psychological memory model. Specifically, an A-S learning method is proposed to realize the processes of encoding and retrieval.
Encoding and retrieval by A-S learning
The long-term memory in LMT is represented by the observations of the target and the background so far. It consists of 2-D appearance model
Encoding process and retrieval process are of prime importance in the human memory system. Since the short-term memory does not retain for a long time and is easy to be covered by new arrival information, the long-term memory of LMT uses the encoding process to remember the repeatedly received target observation. Meanwhile, observation noises may exist in the long-term memory as well, which will decrease the performance of LMT. During tracking, results of SMT are retrieved in the long-term memory frame-to-frame, memory of the successfully retrieved target observations is enhanced, and memory of failed retrieved target observations is forgotten. The forgetting mechanism enables the LMT to eliminate the noises in its long-term memory.
LMT employs a novelty A-S learning method to encode and retrieve the long-term memory. A-S learning is based on the hypothesis that the variation of target’s observation should not be drastic in 2-D appearance and 3-D shape simultaneously. For instance, when illumination change occurs in the scenario, the 2-D appearance may vary drastically due to the projection effect, but the 3-D shape may stay invariant since the point clouds capture does not depend on the illumination; by contrast, when the target rotates in front of the camera, its 3-D shape may vary due to the change of view and its 2-D appearance may merely vary a little since the color and texture on the surface are invariant. In the encoding process, enabling of the appearance learning La and shape learning Ls is set as follows
In the retrieval process, the positive nearest neighbors
Tracking with long-term memory
During tracking, result generated by SMT is evaluated by the long-term memory of LMT. The evaluation is based on the appearance-shape validation (A-S validation): If
Experiments
To demonstrate the efficiency of the proposed algorithm, we evaluate it on a large RGB-D tracking benchmark data set. 7 First, we test and analyze the proposed CPMT framework. Next, we compare the proposed CPMT tracker with state-of-the-art RGB and RGB-D trackers.
Experimental setups
Our algorithm is implemented in native MATLAB without optimization. The experiments are performed on an Intel I5-2400 3.10 GHz CPU with 4 GB RAM.
Implementation details
In SMT, the size of attention area is set to
Evaluation data set
The Princeton Tracking Benchmark (PTB) 7 is used to evaluate our algorithm. The PTB data set contains 100 RGB-D videos and allows the evaluation of both 2-D visual trackers and 3-D visual trackers. The videos in the data set are annotated with 11 attributes according to target type (human, animal, and rigid), target size (large and small), movement (slow and fast), occlusion (yes and no), and motion type (passive and active), indicating different challenges in the visual tracking task. To ensure fair evaluation and comparison with different trackers, ground truths of the data set are reserved to prevent data-specific parameter tuning. To evaluate a tracker, tracking results of all videos in the data set should be packaged and submitted to the website of the data set (http://tracking.cs.princeton.edu), then the evaluation and comparison results are automatically generated online for the tracker.
Evaluation methodology
We employ the evaluation method in PTB 7 to quantitatively evaluate the performance of the proposed algorithm, where the average success rate metric is used. The metric is defined as the area under the curve of the tracker’s success plot, which is generated by changing the overlap threshold of success tracking judgment from 0 to 1 and recording the percentage of successful frames. The overlap between the tracking result and the ground truth is defined as follows
where
Algorithm analysis
In the proposed CPMT tracker, the SMT and LMT collaborate with each other through the encoding and retrieval processes of cognitive psychological memory model, which provides high plasticity and stability simultaneously. To demonstrate the efficiency of the CPMT, we additionally test its short-term component SMT in the PTB data set and compare the performance between them.
As shown in Figure 2, CPMT performs better than SMT with 7.6% average success rate gain. In addition, CPMT outperforms SMT in all 11 tracking attributes. Specifically, the performance improvement is significant in occlusion (14.3%), passive (12.9%), and human (9.6%). In occlusion scenario, observation of the target is covered by the occlusion; in passive scenario, motion of the target is irregular and abrupt motion may happen; in human scenario, drastic observation variations may occur due to the deformability of the target. Therefore, in the above scenarios, the continuities of the tracking task are broken and noises are brought into the observation model. The SMT models the target’s observation with high adaptation short-term memory, which merely works well in continuous conditions and sensitive to noises. By contrast, the CPMT additionally models the target with the long-term memory, which is stable in scenarios with drastic variations and robust to noises. The collaboration of short-term memory and long-term memory achieves both plasticity and stability for the CPMT. In addition, the A-S learning method in long-term memory builds the 2-D appearance model and 3-D shape model of the target in a complementary manner, which makes the model update robust to variations in the scenario.

Comparison of tracker employing the SMT with tracker employing the collaborative short-term and long-term memory (CPMT), CPMT outperforms SMT in all tracking scenarios. Best viewed in color. SMT: short-term memory tracker; CPMT: cognitive psychological memory model-based tracking.
State-of-the-art comparison
We compare the proposed CPMT tracker with both state-of-the-art RGB trackers and state-of-the-art RGB-D trackers. The RGB tracker includes MEEM, 24 KCF, 21 CN2, 22 and Struck; 19 the RGB-D tracker includes OAPF, 14 PST, 15 PrinT, 7 and DS-KCF. 29 The comparison results are shown in Table 1 and Figure 3.
Experimental results of state-of-the-art comparison on the PTB.a
PTB: Princeton Tracking Benchmark; CPMT: cognitive psychological memory model-based tracking; SR: successful rate; HOG: histogram of oriented gradient.
aAverage SRs and rankings (in parentheses) are presented under different attributes. The best and the second best results are in red and blue, respectively.
bThese trackers take advantage of depth (3-D) information. *These trackers are proposed by PrinT: 7 (a) RGBD HOG detection + optical flow + occlusion handling; (b) RGBD HOG detection + optical flow; (c) point cloud detection + optical flow; (d) RGBD HOG detection; (e) point cloud detection; (f) depth HOG detection; (g) RGB HOG detection; (h) point cloud optical flow; (i) optical flow.
#These trackers are benchmarked on the data set but unpublished.

Qualitative evaluation of the top five trackers on PTB: the proposed CPMT, OAPF, 14 PST, 15 PrinT–(a)RGBDOcc+OF, 7 and DS-KCF. 29 Videos from top to down and left to right are basketball2, bdog_occ2, flower_red_occ, libary2.1_occ, new_ex_occ2, new_student_center3, toy_green_occ, two_people_1.3, wuguiTwo_no, and zball_no2, respectively. Our algorithm performs consistently against state-of-the-art trackers. Best viewed in color with high-resolution display. PTB: Princeton Tracking Benchmark; CPMT: cognitive psychological memory model-based tracking.
The average rank and all SR columns in Table 1 show that the proposed CMPT tracker outperforms all RGB trackers and all RGB-D trackers with 8.9% average success rate improvement against the second best tracker. In the attribute-based comparison, the CPMT performs either best or second best in all the 11 attributes. Specifically, the CPMT performs best in 8 of the 11 attributes, for example, human, rigid, large, small, fast, occ, no-occ, and active; the CPMT performs the second best in the other three attributes, for example, animal, slow, and passive. In particular, CPMT performs much better than the second best tracker in active (7.3%), fast(6.7%), and large (5.3%). Additionally, from Table 1, the average performance of RGB-D trackers is much better than the RGB trackers due to the usage of 3-D information. The tracking performance improvements in the comparison results demonstrate that the CPMT tracker can perform robustly in a large range of scenarios. This is because the CPMT models the target’s observation with two complementary models like humans: the short-term memory model achieves high plasticity and the long-term memory model achieves high stability. The collaboration of the two models solves the stability–plasticity dilemma in visual tracking, which makes the CPMT able to adapt to tracking challenges with both fast and drastic variations.
The average frame rate of the proposed CMPT algorithm is 5.4 frames per second (fps). Its speed is much faster than the second best tracker OAPF (0.9 fps, 6× faster), the third best tracker PST (offline, don’t have online frame rate), and the fourth best tracker RGBDOcc+OF (0.1 fps, 54× faster). This is because the encoding and retrieval processes of the algorithm which transfer information between short-term memory and long-term memory are intuitive and efficient by using the proposed A-S learning method.
Conclusion
In this search, we propose a novel visual tracking algorithm (CPMT) inspired by the cognitive psychological memory mechanism. The CPMT decomposes the tracking task into three components like humans: the sensory memory register captures 3-D information from the environment, and 2-D appearance model and 3-D shape model are built simultaneously during tracking; the SMT models the target via rehearsal process with high plasticity; the LMT models the target via encoding and retrieval processes with high stability. Extensive experimental results on a large-scale RGB-D benchmark demonstrate that components of the biological-inspired framework collaborate with each other, and the proposed CPMT performs favorably against the state-of-the-art trackers in terms of efficiency, accuracy, and robustness.
Footnotes
Authors note
Ning An and Shi-Ying Sun are also affiliated to University of Chinese Academy of Sciences, Beijing, China.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the National Natural Science Foundation of China under Grant (61271432, 61673378, 61421004).
