Abstract
BACKGROUND:
Video-based face recognition (VFR) is one of the frontier topics in the domain of computer vision, which aims to automatically track and recognize facial regions of interests (ROIs) in video sequences.
OBJECTIVE:
In videos with multiple faces, the trajectories of individuals are incredibly complex. This is less studied than videos with a single face per frame.
METHODS:
In this paper, we present a multi-trajectory incremental learning (MTIL) algorithm, which categorizes trajectories using a Euclidean distance-based greedy algorithm and estimates the most likely labels for each trajectory by incremental learning to correct their classification and improve the accuracy of recognition. Furthermore, this study proposes an enhanced detection method that combines face detection with a robust tracking-learning-detection (TLD) algorithm to improve the performance of face detection in video. The method can also be extended for medical video recognition applications such as gesture recognition control based medical system.
RESULTS:
Experiments on Honda/UCSD and BMP (seq_mb) database demonstrate that our method can improve the face detection and face recognition (single or multiple) performance. The method also performs well on the gesture recognition system.
CONCLUSION:
The proposed MTIL algorithm can significantly improve the performance of the VFR system and the gesture recognition system.
Introduction
Video-based face recognition (VFR) is a comprehensive research field that includes face detection, target tracking, and face recognition, and has been widely studied by researchers. Although less complicated than VFR problem, gesture recognition is also important in several realistic applications such as the medical video recognition system. Generally speaking, face recognition and gesture recognition can be combined into one research topic. VFR can be divided into recognition based on video sequences and image sets, where the former utilizes the dynamic spatiotemporal information from the sequences [1]. Considerable progress has been made by VFR researchers including the probabilities approach [2], adaptive learning [3], hidden Markov model [4] and radon transform [23]. The adaptive multi-classifier system (AMCS) for video-to-video FR in changing surveillance environment has been presented by Pagano [5]. Torre et al. [6] developed a VFR method based on adaptive skew sensitivity. The method improves the accuracy and robustness of the classifier ensembles by selecting training data with varying levels of imbalance and complexity. They also proposed a method for partially supervised learning from facial trajectories [7]. Dewan et al. [8] developed an adaptive appearance model tracker (AAMT) system that attempts to solve the ‘single sample per person’ (SSPP) problem by creating a track-face-model for each person, which is updated for each frame, and matched to each person’s gallery-face-model recorded in the system.
The selection of non-targets is a difficult problem because the human face is a complex non-rigid model that is prone to influences from poses, lighting, expressions, and appearance changes [9]. In videos with a single human face, the FR system only needs to detect or track one face region on each frame. In contrast, in videos with multiple faces, the trajectories of individuals are incredibly complex and appear synchronously. In this study, we tested a multi-valued classifier algorithm based on local binary patterns histogram (LBPH). Results indicate that the proposed multi-trajectory incremental learning algorithm (MTIL) can utilize general multi-valued classifier-based FR algorithms to match multiple human face trajectories in a video to labels. The most probable label for each trajectory can be estimated and updated, which gradually improves the accuracy of recognition results.
Human faces must be detected before they can be recognized. Current approaches to face detection include those based on machine learning [10], average face templates [11] or head-shoulder detectors [12]. One recent popular face detection approach is to base the face detection on the Viola & Jones (V&J) classifier [13]. However, this classifier has been known to have false negatives or positives in tests due to changes in lighting or poses (especially for the right orientation), which might explained why the algorithm’s training results are insufficiently accurate [5].
The contributions of this paper are as follows. First, it presents an MTIL algorithm, which can recognize multiple faces that simultaneously appear in a scene. The Euclidean distance-based greedy algorithm is used to categorize the trajectories, and each trajectory is stored using a multi-value classifier into classification forms. Second, the accuracy and reliability of face detection have been enhanced by a creative combination of change detection, V&J face detection, and a robust TLD algorithm.
Overview of video-based multi-face tracking and recognition system and change detection
Figure 1 depicts the general framework of the face recognition system based on enhanced detection and MTIL. The system consists of modules of face detection, tracking, face recognition, and trajectory incremental learning. The face detection and tracking systems are connected by the change detection module, which is responsible for determining whether the number of faces have changed and detecting false negatives and positives. The tracking system is ready to constantly make adjustments based on information from the change detectors. The features are extracted using an LBPH operator, which has the advantage of being invariant to rotation and grayscale transformation.
Architecture of the proposed video-based multi-face recognition system using TLD tracking and trajectory improvement learning.
The symbols in Fig. 1 are explained as follows:
CPL
Choosing the tracking algorithm
The current tracking methods can be categorized as based on regions, dynamic profiles, features, and models [16, 17, 18, 19, 20, 21]. A key problem in tracking in the long-term is the variation of the target, such as occlusion, postures, scale, and lighting. It is difficult to ensure the continuity and accuracy of the tracking when the target is obscured or undergoes other local changes from time to time.
The tracking-learning-detection (TLD) algorithm is a long-term tracking algorithm from Kalal et al. [22]. It is extremely robust in handling the occurrence of shape changes, partial occlusions, and other changes to the target using an improved online learning mechanism to continuously update the tracking module’s ‘significant feature points’ and the detection module’s target models and relevant parameters. In this study, we used the TLD algorithm to improve our system’s performance.
Face detection and change detection mechanism
The popular Viola & Jones (V&J) classifier is used for face detection in the initial frames of the video. When applied to VFR, the V&J-based face detector can suffer from errors caused by lighting, poses or expressions. For example, the rotation, skewing or intense expressions may all cause the detector to lose track of the face (false negatives, FNs), while inaccuracies in its initial training results may cause it to identify non-face regions as faces from time to time (false positives, FPs). The change detection module can detect real changes of human faces in the scene and provide correction for the tracker on the number and statuses of faces by removing abnormal decrease or increase of human faces caused by FNs and FPs, respectively. Given that the false detections are in a short duration, we can tally the length of time
MTIL-based face recognition
Multi-trajectory incremental learning (MTIL) algorithm
Recently, good local feature descriptors, such as local binary patterns (LBP) [14] and scale-invariant feature transformation (SIFT) [15], have been widely used in face recognition. We chose the local binary patterns histogram (LBPH) to represent facial features for its moderate computation complexity. Our multi-trajectory incremental learning (MTIL) algorithm tracks multiple face trajectories using Euclidean distance-based greedy algorithm to categorize the trajectories, establish a multi-value classification table for each trajectory, and determine the final result using the majority-voting rule. When a face region is detected, the system marks down its coordinates, checks for the closest trajectories from the tail trajectory table in terms of Euclidean distances, and selects a class label with a majority vote from the trajectory statistics table as the final classification for the region. Thus, it achieves the progressive stabilization of the result.
We have three tables that represent CPL, Tail, and TCC, respectively. CPL is regenerated in each frame of the video, which contains the current information of the faces captured on the screen, whereas the Tail and TCC are created in the first and last frame until the end of the video. Once CPL has been created on a single frame, its information is used to update the information of Tail and TCC. Then, the information of CPL itself will be revised by the aid of Tail and TCC. The final output is the revised information of CPL.
The combined classification process involves the following tasks:
For each frame and each individual Update the trajectory statistical table TCC according to Tail. If TCC is null, it is created with data from Tail, where each trajectory has its initial vote for each class statistics set to 0. If TCC is not null, one vote is added to the class statistics For each individual
We found that the MTIL method often cannot correctly identify the overlapping face regions because the coordinates of the front face and back face overlapped, and the trajectory was classified as belonging to the back face. To address this problem, we added a balancing rule that states when face regions overlap, only the initial class estimations from the classifier model must be used as the result. Tests indicate that this strategy can significantly reduce these false results.
Experimental results
The tests were conducted using the Honda/UCSD Video Database for face tracking and BMP Image Sequences for Elliptical Head Tracking. The Viola & Jones algorithm was used for per-frame face detection and tracking. The robustness of face tracking was achieved through TLD multi-target tracking method combined with a change-detection strategy. The LBPH-trained face classification model used our proposed MTIL algorithm to provide progressive correction to the results of face detection, tracking, and preliminary class estimations of the classifier.
Video-based face detection based on V&J
TLD algorithm
Figures 2 and 3 depict some results of the face detection using only V&J and V&J
Table 1 compares the detection rates and false positive rates of the two methods on Honda/UCSD database. Table 1 shows that the V&J
Face detection rates (%) and FP rates (%) for VFR on single targets from Honda/UCSD
Face detection rates (%) and FP rates (%) for VFR on single targets from Honda/UCSD
Face detection with V&J on Honda/UCSD: a. true positives; b. false negatives; c. false positives.
Face detection with V& J 
Another test video is the seq_mb file from the BMP Image Sequences for Elliptical Head Tracking database. This video is characterized by a low video resolution and drastic head movements (360-degree head rotation or horizontal skewing). The former factor may lead to frequent FPs using V&J, e.g. Fig. 4a, while the latter leads to FNs, e.g. Fig. 4c. The use of the V&J
Single-target face detection and tracking using V&J and V&J
Face detection with V&J and V&J 
Table 2 lists the test results of single-target detection and tracking on seq_mp, with the precision and recall rates calculated by the following equations:
V&J
Single-trajectory video-based face recognition
The full VFR tests both use V&J
Comparison of correct recognition rates (%) and false recognition rates (%) between LBPH and LBPH
MTIL on single targets from Honda/UCSD
Comparison of correct recognition rates (%) and false recognition rates (%) between LBPH and LBPH
Variations of precision rates and recall rates for LBPH and LBPH 
Table 3 compares the correct recognition rates (frames of correct recognition/total frames) and false recognition rates (false positive frames/frames with detected faces) between LBPH and LBPH
The video for multi-trajectory VFR tests is taken from the second half of seq_mb from the BMP Image Sequences. The segment provides the complexity factor for multi-trajectory recognition because it contains two individuals who obscured each other during the video, one of which had first left and then reentered the scene. Figure 5 compares the algorithms’ effect on precision rates and recall rates. Table 4 compares the final data.
Results of multi-ROI recognition with LBPH and LBPH
MTIL on seq_mb. (Unit: frame)
Results of multi-ROI recognition with LBPH and LBPH
Experiments show that compared with the original algorithm, this method improves the accuracy of recognition. For LBPH
Hand detection using skin-color algorithm.
a. hand gestures used in our experiment; b. gesture samples extracted from the row image.
The proposed method can also be used for medical system such as gesture recognition based touchless visualization system for medical volume [24]. Instead of the V&J algorithm, we used the skin-color detection algorithm in HSV color space to deal with the hand detection problem as shown in Fig. 6, since V&J face detection algorithm cannot be applied to gesture recognition. We first use skin-color detection to find the proximate area, and then apply binarization to eliminate the redundant part such as the clothing.
As the experiment setting of [24], we adopt 7 gestures to conduct the experiment as shown in Fig. 7a. They are Finger up, Finger down, Finger left, Finger right, Palm up, Palm down and Grasp. Some samples extracted from the row image are shown in Fig. 7b.
We collected 897 gesture samples and split them into two half parts, that is, training part and test part. The gesture recognition experiment is conducted using LBPH
Gesture recognition result
Gesture recognition result
Video-based face recognition is a challenging problem that combines tracking, detection, and recognition. Gesture recognition is similar to face recognition. It can be used on medical recognition based touchless visualization system The Viola & Jones algorithm has been widely used in VFR, but systems based on V&J are known to have false negatives or positives in tests. The accuracy and reliability of face detection can be improved by a combination of TLD and change detection based on video continuity. Tests have shown that our approach can recognize multiple targets from videos, while improving the precision recognition over time. The TLD algorithm combined with a change-detection strategy significantly improved the accuracy and robustness of face detection. Tests on a video from BMP show that the V&J
The accuracy of FR tends to progressively regress over time. This enables us to correct the classification results using spatiotemporal information from the video. The establishment and classification of face trajectories are particularly difficult when more than one face appear on the scene. Our proposed multi-trajectory incremental learning algorithm can track and recognize multiple faces in the video using a Euclidean distance-based greedy algorithm to classify the trajectories, storing each trajectory’s data in multi-value statistics tables, and basing the final results on the majority-voting rule. Tests on videos from Honda/UCSD show that the LBPH
To conduct the experiment of gesture recognition, skin-detection and binarization are used to detect hand samples, then LBPH
Footnotes
Conflict of interest
None to report.
