Sage Journals: Discover world-class research

Abstract

Generally, face detection and tracking focus only on visual data analysis. In this paper, we propose a novel method for face tracking in camera video. By making use of the context metadata captured by wearable sensors on human bodies at the time of video recording, we could improve the performance and efficiency of traditional face tracking algorithms. Specifically, when subjects wearing motion sensors move around in the field of view (FOV) of a camera, motion features collected by those sensors help to locate frames most probably containing faces from the recorded video and thus save large amount of time spent on filtering out faceless frames and cut down the proportion of false alarms. We conduct extensive experiments to evaluate the proposed method and achieve promising results.

1. Introduction

Locating and tracking faces in video streams have long been the most fundamental techniques in computer vision. They are step stones of almost all facial analysis algorithms including face alignment, face modeling, face recognition, and gender/age recognition and have enabled numerous applications, such as human-computer interaction (HCI), video surveillance, and many other multimedia applications. Particularly in the context of HCI, only when computers could understand human faces well could they begin to truly figure out people's intentions and thoughts and react in a proper manner.

In general, the goal of face detection is to determine whether or not there are any faces present in an arbitrary image and, if present, return the location and extent of each face. While this appears to be a simple task for human beings, it is very difficult for computers and has been a hot topic in machine vision that attracts top researchers all over the world in the past few decades. The difficulties associated with face detection can be attributed to many variations of lighting conditions, scale, location, orientation, pose, facial expression and occlusions, and so forth. In addition, intraclass interferences that arise from make-up, beard, mustache, and glasses of the same person make face detection problem even more hard.

In recent years, face detection has made significant progress and been increasingly utilized in real world applications and products, like Google's Picasa. Nowadays most digital cameras are equipped with built-in face detector to help autofocus. However, face detection in unconstrained settings remains a challenging task. Modern face detection algorithms are mostly based on low level feature extraction and statistic model training and focus their attention wholly on visual data analysis. Herein more and more complex features and rigorous learning algorithms are developed to extract as much information as possible from the visual content.

Automatic face tracking requires face detection to initialize the tracking process. It is an application of object tracking. In its simplest form, tracking can be defined as the problem of estimating the trajectory of an object in the image plane as it moves around a scene [1]. In terms of face tracking, a tracker assigns consistent labels to detected faces in consecutive video frames. The main challenge in tracking is clutter. Clutter is the phenomenon when features expected from the target are difficult to discriminate against features extracted from other objects in the scene. Another challenge is introduced by appearance variations of the target itself. Intrinsic appearance variability includes pose variation and shape deformation, whereas extrinsic appearance variability includes illumination change, camera motion, and different camera viewpoints [2].

In this paper, we focus our attention to handle the task of face tracking with a new perspective. Typical face detection and tracking are conducted frame by frame and window by window. In terms of face detection, the time spent on filtering out a faceless frame is comparable to that on identifying a frame containing faces as every search window needs to be checked to ensure all possible faces are detected. Faces are tracked in following frames using relatively less computationally expensive methods. In case of track failure, face detection runs again to reinitialize the tracker. Large amount of time is wasted on searching faces in faceless frames. For example, when a subject in video turns his back and walks away from the camera, his face totally disappears and from that moment on there is no need to apply face detection and tracking. To improve the performance and cut down time cost, we take advantage of context metadata collected at the time of video capture in a sensor-assisted environment to rule out potential faceless frames.

The rapid advances in consumer electronics have led to a wide proliferation of cheap powerful wearable sensors, such as accelerometer, digital compass, gyroscope, and GPS. The availability of these sensors initially included in smart phones to improve user experience is now changing the landscape of potential applications and providing reliable sources of contextual information which helps to model human behavior. In this study, we employ smart phones as sensing platforms to collect measurements of orientation sensor and help interpreting human moving direction which is explored and utilized to improve face detection and tracking. To summarize, the main contributions of this paper are twofold. First, we present a sensor-assisted fast face detection and tracking approach. As far as we know, it is the first attempt to integrate personal sensing technologies into face detection and tracking in video. This integration of a new sensing model broadens the domain of semantic analysis of visual content and will be catalyzed by the growing popularity of wearable devices and concurrent advance in ubiquitous computing. Second, we implement a set of state-of-the-art multiobject tracking algorithms and conduct extensive experiments to evaluate our method.

The remainder of this paper is organized as follows. Section 2 presents the related work. Section 3 introduces the problem we tend to deal with. Section 4 details the proposed method. Section 5 describes our experiments together with result analysis. Concluding remarks are placed in Section 6.

2. Related Work

2.1. Face Detection

There have been hundreds of reported approaches to handle the problem of face detection. Based on early works of Yang et al. [3], existing face detection approaches can be grouped into four categories: knowledge-based methods, feature invariant approaches, template matching methods, and appearance-based methods. Knowledge-based methods employ predefined rules to determine face presence based on human knowledge; feature invariant approaches aim to find face structure features that are robust to pose and lighting variations; template matching methods make use of prestored face templates to judge if a face exists in an image; appearance-based methods rely on techniques from statistical analysis and machine learning to find the relevant characteristics of face and nonface images. The learned characteristics are in the form of distribution models or discriminant functions that are consequently used for face detection. Meanwhile, dimensionality reduction is usually carried out for the sake of computation efficiency and detection efficacy. Among these approaches, appearance-based methods have distinguished themselves as the most promising ones and had been showing superior performance to the others.

There are mainly two important factors that determine the success of a face detector: the features used for representing face images and the learning algorithm that implements the detection. Histogram based features have become very popular in recent years due to their excellent performance and efficiency, including local binary patterns [4], local ternary patterns [5], and histograms of oriented gradients [6]. Most state-of-the-art face detection methods usually use a combination of these features by concatenating them or by optimizing combination coefficients at the learning stage.

In terms of learning, most approaches treat face detection problem as a binary classification problem and determine whether current search window contains a face. Various machine learning methods ranging from the nearest neighbor classifier to more complex approaches such as neural networks, convolution neural networks, and classification trees have been employed for face detection. Among them, boosting based cascades have attracted a lot of research interest. Viola and Jones [7] introduced a very efficient face detector by using AdaBoost to train a cascade of pattern-rejection classifiers over rectangular wavelet features. Each stage of the cascade is designed to reject a considerable fraction of the negative cases that survive to that stage, so most of the windows that do not contain faces are rejected early in the cascade with comparatively little computation. As the cascade progresses, rejection typically gets harder so the single-stage classifiers grow in complexity. The structure of the cascaded detection process is essentially that of a degenerate decision tree. In our work, we employ this detector to initialize face tracking.

2.2. Object Tracking

Two tracking paradigms have been presented in [8]. Recursive tracking methods estimate current state $x_{t}$ of an object by applying a transformation on the previous state $x_{t - 1}$ based on measurements $z_{1} \dots z_{t}$ taken in the respective images. The recursive estimation of a state depends on state of the object in previous frame and is susceptible to error accumulation. For instance, Lucas and Kanade [9] propose a method for estimating sparse optical flow within a window around a pixel. The optic flow is fit into a transformation model that is used to predict new position of the object. Comaniciu et al. [10] propose a tracker based on mean shift. The transformation of the object state is obtained by finding the maximum of a similarity function based on color histograms, while tracking-by-detection methods [11, 12] treat tracking as a classification problem and train a classifier to distinguish the object from the background. The object detector (classifier) could be static or updated online. In [13], a SVM (support vector machine) classifier is integrated into an optic-flow-based object tracker. However, the SVM classifier is trained beforehand and unable to adapt. Adaptive discriminative trackers [14, 15] train and update a classifier using new training examples acquired during tracking.

The research efforts mentioned above focus their attention wholly on the analysis of visual data. While in this study, we provide a novel method to exploit contextual information collected at the point of video recording to help face detection and tracking in video.

3. Problem Formulation

Subjects carrying smart phones move around casually in the FOV of a fixed digital camera. Video data are continuously recorded and motion measurements are collected by embedded sensors on the phone. As depicted in Figure 1, direction measurements captured by orientation sensor contained in the red brace indicate that the subject is moving toward camera, and during this period the camera could most likely capture clear faces. Based on this judgement, we could apply face detection and tracking directly to frames recorded in this period and skip faceless frames before and after. Our objective is to figure out these advantageous situations and improve face detection and tracking in various situations with the help of on-body sensors.

Figure 1

An application scenario of the proposed method. We believe that time period defined by the red brace is probably most suitable for face detection and tracking, during which the subject moves towards the camera.

4. Proposed Method

In this section, we elucidate the proposed method in detail. As illustrated in Figure 2, $\vec{A O}$ is direction of camera lens and can be adjusted as needed. As camera FOV is symmetric about $\vec{A O}$ , we just focus our analysis on half of the FOV. $\vec{O C}$ is moving direction of a subject carrying smart phones. We assume that moving direction of a subject keeps consistent with his facing direction. The less $∠ A O C$ is, the more likely a face will appear. When $∠ A O C$ equals 0°, the subject moves right towards the camera and frontal faces could be then captured. When $∠ A O C$ is less than 90°, a portion of facial features are lost and performance of face detection and tracking varies with different algorithms. When $∠ A O C$ is equal to or greater than 90°, detailed features of frontal faces totally disappear and most face detection algorithms fail to locate any faces. Thus we only need to consider situations satisfying 0° ≤ $∠ A O C$ ≤ 90°.

Figure 2

A subject moving in the FOV of a camera.

Based on the above analysis, we propose a two-stage automatic face tracking framework. In first stage, context metadata collected at the time of video capture are scanned to identify advantageous situations for visual analysis. Video frames are automatically labeled to indicate whether they contain faces. In second stage, face detection and tracking are conducted over the labeled frames to locate and track all faces.

4.1. Sensor Description

Two types of sensor are involved in the proposed method, camera sensor and orientation sensor. Video streams recorded from the camera sensor are saved on the disk as discrete files. In this subsection, we put emphasis on orientation measurements collection from orientation sensor on smart phones.

Currently most smart phones are equipped with various types of specialized sensors originally aimed at improving user experience, including orientation sensor. An orientation sensor usually consists of an accelerometer and a magnetometer and can sense the orientation of a smart phone relative to the earth with three different values, pitch, roll, and azimuth, as shown in Figure 3. Pitch indicates rotation about the x-axis and ranges from −180° to 180° inclusively, with positive values when the z-axis tilts toward the y-axis; roll indicates rotation about y-axis and ranges from −90° to 90° inclusively, with positive values when the z-axis tilts toward the x-axis. Azimuth indicates angle between the y-axis and magnetic north direction and ranges from 0° to 359° inclusively. Experiments have demonstrated that with the specified phone attachment shown in Figure 3, azimuth angle of the smart phone could be utilized to estimate moving direction of human body.

Figure 3

(a) A Logitech C615 digital camera fixed to a tripod. (b) An HTC G3 Android smart phone with a built-in 3-axis accelerometer and a geomagnetic field sensor. The phone is attached to the waist belt of a subject. (c) Reference frame of a smart phone, with front view to the page.

4.2. Frame Labeling

We conduct a preliminary experiment to quantitatively measure impacts of moving direction on face detection. First, we divide $∠ A O B$ equally into six sectors labeled from sector 1 to sector 6 counterclockwise. Each sector covers an angle of 15°, as shown in Figure 4. We recruit three participants to move towards the camera in each sector from a far point where his face could be barely distinguished with the naked eye. To rule out influence of interference factors, we require that the participant moves in his most relaxed way in order to maintain the same motion status in all sectors and keep facing his moving direction as far as possible. We use a Logitech C615 HD webcam to record video clips, as shown in Figure 3. Video resolution and frame rate are set to 640 × 480 and 20 frames per second, respectively. Video frames are encoded with libx264 and stored in the format of mp4 files on a Ubuntu 12.10 PC.

Figure 4

Sector layout in half of camera FOV. Each sector covers an angle of 15^°.

We manually count the number of faces in every video clip and then perform face detection using a Haar feature based face detector [7] in OpenCV [16]. Statistics about collected data and detection results are listed in Table 1, where $VNo (i)$ is the number of video clips recorded in sector i. Three participants each repeat their motion five times and $15$ video clips are recorded in each sector. $AvgFr (i)$ is averaged number of frames in video clips in sector i, where

\begin{matrix} AvgFr (i) = \frac{\sum_{j = 1}^{j = VNo (i)} FrNo (i, j)}{VNo (i)}, 1 \leq i \leq 6; \end{matrix}

(1)

FrNo (i, j)

is the number of frames of video clip j in sector i. As listed in Table 1,

AvgFr (i)

decreases from

102

in sector

1

to 52 in sector

6

. This is mainly due to the fact that when each participant starts from the same position and moves towards the camera with the same speed, the smaller the angle

∠ A O C

in Figure 2 is, the shorter the time he stays in camera FOV and the less the frames will be recorded in the corresponding sector.

AvgFa (i)

and

AvgFaD (i)

are, respectively, the averaged numbers of manually labeled and detector predicted faces in video clips in sector i, where

\begin{matrix} AvgFa (i) = \frac{\sum_{j = 1}^{j = VNo (i)} FaNo (i, j)}{VNo (i)}, 1 \leq i \leq 6, \\ AvgFaD (i) = \frac{\sum_{j = 1}^{j = VNo (i)} FaDNo (i, j)}{VNo (i)}, 1 \leq i \leq 6; \end{matrix}

(2)

FaNo (i, j)

and

FaDNo (i, j)

are, respectively, the number of manually labeled and detector predicted faces in video clip j in sector i.

Table 1

Basic information about video clips collected from each sector.

Sector	VNo	AvgFr	AvgFa	AvgFaD
1	15	102	102	114
2	15	103	103	82
3	15	92	73	31
4	15	87	43	13
5	15	60	5	3
6	15	52	0	0

From the result we could conclude that faces, either manually labeled (AvgFa) or detector predicted (AvgFaD), start to decrease dramatically from sector $3$ , and the detector could barely detect any faces when participants move in the last two sectors (sector $5$ and sector $6$ ), even though a dozen of profile faces could be distinguished by human eyes. This is due to the absence of enough facial features. The detected faces in sectors 4 and 5 are mostly false positive results (fake faces). Thus we could define advantageous situations as $|∠ A O C|$ ≤ 45° and skip frames on other occasions. Although it is a coarse-grained threshold, we could demonstrate the great improvement it brings to face detection and tracking in the following sections.

With the obtained threshold, we design a frame labeling algorithm, as listed in Algorithm 1. All collected orientation measurements are first smoothed to reduce noise and then scanned to label video frames recorded at that time period to prepare for face detection and tracking. Once a qualified azimuth sample is detected within a time window, all frames within that window will be labeled positive. By this strategy we could alleviate the adverse impacts of sudden body turning. With respect to situations when multiple subjects exist, to simplify the problem, we assume that all subjects stay within camera FOV in all experiments. Final result of frame labeling is calculated by applying logical OR to the set of $f_{i}$ obtained from analyzing orientation measurements of each subject.

Algorithm 1: Frame labeling.

Input:

$O_{j}$ : azimuth angle from orientation sensor, $1 \leq j \leq M$ .

$O_{c}$ : direction of camera lens

$F_{i}$ : video frame i, $1 \leq i \leq N$ .

$θ$ : threshold angle for frame labeling.

w: time window in seconds.

Functions and Variables:

$t ()$ : timestamp extraction function.

$t_{1}$ : starting point of a time window.

$t_{2}$ : ending point of a time window.

b: a boolean variable indicating whether frames within a time window should be labeled positive.

Output:

$f_{i}$ : boolean variable indicating whether $F_{i}$ contains faces.

(1) $f_{i} \leftarrow f a l s e, t_{1} \leftarrow \max (t (F_{1}), t (O_{1})), t_{2} \leftarrow t_{1} + w$

(2) while $t_{2} \leq \min (t (F_{N}), t (O_{M}))$ do

(3) $b \leftarrow f a l s e$

(4) for all $O_{j}, t_{1} \leq t (O_{j}) \leq t_{2}$ do

(5) if $| O_{j} - O_{c} | \leq θ \lor (360 + O_{j} - O_{c}) \leq θ \lor (360 - O_{j} + O_{c}) \leq θ$ then

(6) $b \leftarrow t r u e, b r e a k$

(7) end if

(8) end for

(9) if $b i s t r u e$ then

(10) for all $t_{1} \leq t (F_{i}) \leq t_{2}$ do

(11) $f_{i} \leftarrow t r u e$

(12) end for

(13) end if

(14) $t_{1} \leftarrow t_{2}, t_{2} \leftarrow t_{2} + w$

(15) end while

4.3. Face Tracking

In this section, we provide three sensor-assisted tracking algorithms to track faces in the labeled frame sequences, including track by detection, track by mean shift, and track by TLD (tracking-learning-detection). In these algorithms, we employ the Viola-Jones face detector [7] to initialize the tracking process. Moreover, to reduce detection time, we filter out nonskin areas from each frame using a skin model presented in [17]. To deal with multiface tracking, we design a face classification algorithm to group different faces, as shown in Algorithm 2. The logic of the algorithm is straightforward. For each face, we search the most similar descriptor by comparing the normalized face patches. In this paper we resize each face to a $15$ × $15$ patch in pixel before comparison. Patch similarity is conducted on pixel-to-pixel level using the metric defined in (3), where $μ_{1}$ , $μ_{2}$ , $σ_{1}$ , and $σ_{2}$ are means and standard deviations of image patches $p_{1}$ and $p_{2}$ . The value of $n c c (p_{1}, p_{2})$ ranges from $- 1$ to $1$ . The larger $n c c (p_{1}, p_{2})$ , the more similar $p_{1}$ to $p_{2}$ . When most similar descriptor is found and the similarity is qualified, the descriptor is updated with features of current face. Otherwise a new descriptor is created and saved. The descriptor will be used in the following sensor-assisted face tracking algorithms:

\begin{matrix} n c c (p_{1}, p_{2}) = \frac{1}{n - 1} \sum_{x = 1}^{n} ‍ \frac{(p_{1} (x) - μ_{1}) (p_{2} (x) - μ_{2})}{σ_{1} σ_{2}} . \end{matrix}

(3)

Algorithm 2: Face classification.

Input:

$f a c e$ : an image patch containing a human face.

$f D e s c s$ : a vector of face descriptor $f D e s c$ , $f D e s c = {h i s t, b B o x, n P a t c h e s, f I d x e s}$ .

$h i s t$ is HSV color histogram of a face.

$b B o x$ is the bounding box of a face and calculated by $b o u n d i n g R e c t (f a c e, b B o x)$ .

$n P a t c h e s$ is a vector of normalized face patch $n P a t c h$ calculated by $r e s i z e (f a c e, n P a t c h)$ .

$f I d x e s$ is a vector of indexes of frames that containing the face described by $f D e s c$ .

$f I d x$ : index of current frame.

$θ_{n c c}$ : threshold value for similarity of two normalized face patches.

(1) function CLASSIFY $(f a c e, f D e s c s, f I d x)$

(2) $r e s i z e (f a c e, n P a t c h)$

(3) $c a l c H i s t (f a c e, h i s t)$

(4) $b o u n d i n g B o x (f a c e, b B o x)$

(5) $n c c_{\max \max} \leftarrow 0$

(6) $i_{\max \max} \leftarrow - 1$

(7) for all $i, 0 \leq i < f D e s c s . s i z e ()$ do

(8) $n c c_{\max} \leftarrow 0$

(9) $k \leftarrow f D e s c s [i] . n P a t c h e s . s i z e ()$

(10) for all $j, 0 \leq j < k$ do

(11) $n c c^{*} \leftarrow n c c (n P a t c h, f D e s c s [i] . n P a t c h e s [j])$

(12) if $n c c^{*} > n c c_{\max}$ then

(13) $n c c_{\max} \leftarrow n c c^{*}$

(14) end if

(15) end for

(16) if $n c c_{\max} > n c c_{\max \max}$ then

(17) $n c c_{\max \max} \leftarrow n c c_{\max}$

(18) $i_{\max \max} \leftarrow i$

(19) end if

(20) end for

(21) if $n c c_{\max \max} > θ_{n c c}$ then

(22) $f D e s c s [i_{\max \max}] . f I d x e s . p u s h_b a c k (f I d x)$

(23) $f D e s c s [i_{\max \max}] . n P a t c h e s . p u s h_b a c k (n P a t c h)$

(24) $f D e s c s [i_{\max \max}] . h i s t \leftarrow h i s t$

(25) $f D e s c s [i_{\max \max}] . b B o x \leftarrow b B o x$

(26) else

(27) Create new descriptor $f D e s c^{*}$

(28) $f D e s c^{*} . f I d x e s . p u s h_b a c k (f I d x)$

(29) $f D e s c^{*} . n P a t c h e s . p u s h_b a c k (n P a t c h)$

(30) $f D e s c^{*} . h i s t \leftarrow h i s t$

(31) $f D e s c^{*} . b B o x \leftarrow b B o x$

(32) $f D e s c s . p u s h_b a c k (f D e s c^{*})$

(33) end if

(34) end function

(1) Track by Detection. As shown in Algorithm 3, we apply face detection over each positively labeled frame containing faces. The detected faces are then classified using Algorithm 2. The performance of this algorithm totally relies on the generalizability and representability of training samples of the detector. We use this algorithm as a benchmark for parameter optimization in Section 5.1.

Algorithm 3: Track by detection.

Input:

$F_{i}$ : video frame i, $1 \leq i \leq N$ , N is video frame count.

$f_{i}$ : boolean variable indicating whether $F_{i}$ contains faces.

Output:

$f D e s c s$ : vector of face descriptors recording which frame a face appears.

(1) for all $i, 1 \leq i \leq N$ do

(2) if $f_{i} i s f a l s e$ then

(3) $c o n t i n u e$

(4) end if

(5) Detect faces from $F_{i}$ and save them to $f a c e s$

(6) for all $f a c e i n f a c e s$ do

(7) $c l a s s i f y (f a c e, f D e s c s, i)$

(8) end for

(9) end for

(2) Track by Mean Shift. We provide a tracking algorithm based on mean shift [10] in Algorithm 4. Mean shift is a procedure for locating the maxima of a density function given discrete data sampled from that function. In $m e a n S h i f t (F_{i}, f D e s c . h i s t, f D e s c . b B o x, r e c t)$ , a confidence map of the face $f D e s c$ in current frame $F_{i}$ is first created using its color histogram $f D e s c . h i s t$ in previous frame; then by searching for a local peak using the confidence map, $r e c t$ is estimated as most probable position of the face in current frame based on its previous position $f D e s c . b B o x$ . To filter out long lost faces, we only consider face descriptors that keep active till the previous frame. We adopt spatial overlap of face bounding boxes as a metric to distinguish successful detections, as defined in (4). $b_{1}$ and $b_{2}$ are bounding boxes of two faces. $o v e r l a p (b_{1}, b_{2})$ ranges from $0$ to $1$ . A fast move of head may cause a zero overlap. When $o v e r l a p (b_{1}, b_{2})$ is below a threshold value, a track failure occurs. To handle these situations, we conduct face detection at a fixed interval to reinitialize the tracking process, as illustrated in Algorithm 4:

\begin{matrix} o v e r l a p (b_{1}, b_{2}) = \frac{b_{1} \cap b_{2}}{b_{1} \cup b_{2}} . \end{matrix}

(4)

Mean shift algorithm has difficulty in tracking fast moving targets and suffers from local optimal problem caused by the mountain climbing optimum algorithm it used for searching

r e s t

Algorithm 4: Track by mean shift.

Input:

$F_{i}$ : video frame i, $1 \leq i \leq N$ . N is video frame count.

$f_{i}$ : boolean sequence indicating whether $F_{i}$ contains faces.

int: the interval of track reinitialization.

$θ_{o}$ : threshold value for rectangle overlap.

Output:

$f D e s c s$ : vector of face descriptors recording which frame a face appears.

(1) $j \leftarrow 0$

(2) for all $i, 1 \leq i \leq N$ do

(3) if $f_{i} i s f a l s e$ then

(4) $c o n t i n u e$

(5) end if

(6) if $j % int = 0$ then

(7) Detect faces from $F_{i}$ and save them to $f a c e s$

(8) for all $f a c e i n f a c e s$ do

(9) $c l a s s i f y (f a c e, f D e s c s, i)$

(10) end for

(11) else

(12) for all $f D e s c i n f D e s c s$ do

(13) Get last index $f I d x$ from $f D e s c . f I d x e s$

(14) if $f I d x + 1 \neq i$ then

(15) $c o n t i n u e$

(16) end if

(17) $m e a n S h i f t (F_{i}, f D e s c . h i s t, f D e s c . b B o x, r e c t)$

(18) if $o v e r l a p (f D e s c . b B o x, r e c t) > θ_{o}$ then

(19) $f D e s c . b B o x \leftarrow r e c t$

(20) $f D e s c . f I d x e s . p u s h_b a c k (i)$

(21) $c a l c H i s t (r e c t, h i s t), f D e s c . h i s t \leftarrow h i s t$

(22) end if

(23) end for

(24) end if

(25) $j \leftarrow j + 1$

(26) end for

(3) Track by TLD. TLD was proposed by Kalal et al. [18, 19]. It is a framework designed for long-term tracking of an unknown object in unconstrained environments. The object is tracked and simultaneously learned in order to build a detector that supports the tracker once it fails. The detector is built upon the information from the first frame as well as the information provided by the tracker. The original TLD tracker tracks only one object. We create a multiobject tracker based on OpenTLD provided in [20]. In Algorithm 5, we first initialize TLD tracker in $i n i t T L D (f D e s c s)$ , where $f D e s c s$ contains descriptive information of faces detected from the first $N_{1}$ positively labeled frames. By initialization, an internal classifier is trained for each face in $f D e s c s$ using training data extracted from the first $N_{1}$ positive frames. Positive training samples are obtained by applying affine warping transformation to the detected face region. Negative samples are obtained by collecting rectangles with similar dimension from nonface areas in the frame. In addition, positive samples of one face are added to negative training sets of other faces for performance improvement. While tracking, $T L D (F_{i}, r e c t s)$ tracks and detects the faces in parallel in each new frame $F_{i}$ and fuses the result of internal tracker and detector into $r e c t s$ . Then, positive and negative training samples are again extracted based on the fusion result to update the detector. The main limitation of the algorithm is that it cannot track faces that appear after the initialization. Thus, this method can only apply to situations where count of subjects will not increase. In this paper, we set $N_{1}$ to values that will include all faces that will appear in a video.

Algorithm 5: Track by TLD.

Input:

$F_{i}$ : video frame i, $1 \leq i \leq N$ . N is video frame count.

$f_{i}$ : boolean sequence indicating whether $F_{i}$ contains faces.

$N_{1}$ : count of positively labeled frames for TLD initialization.

Output:

$f D e s c s$ : vector of face descriptors recording which frame a face appears.

(1) $j \leftarrow 0$

(2) for all $i, 1 \leq i \leq N$ do

(3) if $j = N_{1}$ then

(4) $b r e a k$

(5) end if

(6) if $f_{i} i s f a l s e$ then

(7) $c o n t i n u e$

(8) end if

(9) Detect faces in $F_{i}$ and save them to $f a c e s$

(10) for all $f a c e i n f a c e s$ do

(11) $c l a s s i f y (f a c e, f D e s c s, i)$

(12) end for

(13) $j \leftarrow j + 1$

(14) end for

(15) $initTLD (f D e s c s)$

(16) for all $i, i \leq N$ do

(17) if $f_{i} i s f a l s e$ then

(18) $c o n t i n u e$

(19) end if

(20) $TLD (F_{i}, r e c t s)$

(21) for all $j, 0 \leq j \leq r e c t s . s i z e ()$ do

(22) if $r e c t s [j] n o t n u l l$ then

(23) $f D e s c s [j] . r e c t \leftarrow r e c t s [j]$

(24) $f D e s c [j] . f I d x e s . p u s h_b a c k (i)$

(25) end if

(26) end for

(27) end for

5. Experiments

In this section, we conduct extensive experiments to evaluate the proposed method. In addition to video devices and capture settings used in Section 4.2, we also utilize Android smart phones equipped with orientation sensors. Two subjects are recruited to take part in our experiments and phones are attached to waist belt where moving direction of subjects could be best approximated, as shown in Figure 3. A simple GUI application is created to start and stop data collection on phones. Orientation measurements are recorded and saved in text files on phone SD card and later accessed via USB. We implement labeling and tracking algorithms proposed in Section 4 based on OpenCV library [16]. Experiments are performed in various situations including indoor single-face, indoor multiple-face, outdoor single-face, and outdoor multiple-face. Subjects move randomly within part of camera FOV where their faces could be distinguished by the naked eye. In each situation, we repeat the experiment four times and each lasts about five minutes. In all we collect sixteen video clips and thirty-two text files of orientation measurements.

5.1. Tracking Optimization

To label frames using Algorithm 1, we set $θ$ to 45° which has been experimentally proved in Section 4.2. In terms of time window w, a larger window leads to less missed faces at the cost of potentially more faceless frames. We define a metric in 6 to measure the performance of Algorithm 1 on video i. $P (i)$ is a ratio between number of positively labeled frames containing faces and number of positively labeled frames. It ranges from $0$ to $1$ . The larger $P (i)$ , the more efficient Algorithm 1. To rule out interference factors in multiple-face situations like mutual occlusion of human bodies, we run Algorithms 3 and 1 over the single-face data with different w. The averaged P obtained in indoor and outdoor situations is illustrated in Figure 5. An optimal P is achieved in the vicinity of $w = 0.5$ :

\begin{matrix} P (i) = \frac{# positively labeled frames having faces}{# positively labeled frames} . \end{matrix}

(5)

Figure 5

Performance of frame labeling with different time windows.

The threshold $θ_{n c c}$ in Algorithm 2 affects face classification accuracy and varies with different tracking algorithms and application scenarios. We use a value of $θ_{n c c} = 0.65$ for all our experiments. In Algorithm 4, two parameters will affect tracking performance. When overlap between bounding boxes of a face tracked across two consecutive frames is below $θ_{o}$ , a track failure occurs. The larger $θ_{o}$ is, the more likely a positive true tracking result will be obtained. An optimal $θ_{o}$ varies with different human motion patterns. In our experiment, we set $θ_{o}$ empirically to $0.4$ . The parameter $i n t$ specifies the interval at which we reinitialize trackers to deal with tracking failures. When $i n t = 1$ , Algorithm 4 degrades into Algorithm 3 and it applies face detection on each positively labeled frame. When int is set to the length of a video clip, Algorithm 4 stops as soon as mean shift fails. A larger int costs less detection time and suffers the risk of high probability of false faces. We employ metrics defined in ((6a), (6b), and (6c)) and run Algorithm 4 on the single-face data. F-score can be interpreted as a weighted average of precision and recall; precision is the fraction of face detections that are true faces; recall is the fraction of true faces that are detected; tp is the number of detections that are true faces; fp is the number of detections that are false faces; fn is the number of missed faces. Precision, recall, and F-score reach their best at $1$ and worst at $0$ .

As illustrated in Figure 6, Algorithm 4 achieves its best result in both indoor and outdoor environment when using an interval of $20$ frames:

p r e c i s i o n (i) = \frac{t p (i)}{t p (i) + f p (i)},

(6a)

r e c a l l (i) = \frac{t p (i)}{t p (i) + f n (i)},

(6b)

F - s c o r e (i) = 2 * \frac{p r e c i s i o n (i) * r e c a l l (i)}{p r e c i s i o n (i) + r e c a l l (i)} .

(6c)

Figure 6

Averaged F-score obtained from mean shift tracking with different intervals. The first interval is 1 frame.

5.2. Tracking Comparison

In this subsection, we compare the sensor-assisted face tracking algorithms depicted in Algorithms 3, 4, and 5 with their sensorless counterparts in terms of performance and processing speed. To conduct sensorless face tracking, we just set elements of $f i$ , the output of Algorithm 1, all to $t r u e$ and keep subsequent tracking algorithms unchanged. We run the optimized algorithms on the collected data and calculate the averaged results under each situation. As shown in Figure 7, the sensor-assisted algorithms achieve comparable performance to their sensorless counterparts in terms of recall, while precision of the sensor-assisted version obviously overweighs the sensorless ones. It is attributed to the removal of false alarms that might be caused over negatively labeled frames, as illustrated in Figure 8. In addition, due to exclusion of faceless frames, sensor-assisted tracking algorithms achieve higher processing speeds, as illustrated in Figure 9. The superiority becomes more evident especially under single-face situations where a comparatively larger percentage of frames are labeled negatively. Extracted frames from the results are illustrated in Figure 10.

Figure 7

Precision and recall of sensorless and sensor-assisted face tracking algorithms in each situation.

Figure 8

Negatively labeled frames that contain false positive results in different situations.

Figure 9

Processing speed of sensorless and sensor-assisted tracking algorithms in each situation.

Figure 10

Extracted frames of tracking results, where (a1)–(f1) are from track by detection, (a2)–(f2) are from track by mean shift, and (a3)–(f3) are from track by TLD.

6. Conclusion

In this paper, we propose a novel method for fast face tracking. The method innovatively leverages sensor captured contextual information and could be utilized as a preliminary step to assist various algorithms for face detection and tracking in video. Experiment results demonstrate the performance improvement brought by the proposed method. However, the method is limited in the following aspects. First, users have to register and carry their smart phones in order to facilitate the tracking process. This necessary attachment of sensors damages the unobtrusiveness of visual sensing and causes inconvenience to users and limits application of the method to specific groups of people at some restricted places, where their healthcare and security are concerned, such as inpatients in hospitals and elders in nursing homes. Second, frames might be mislabeled on some occasions which may damage performance of the method. For example, when a subject turns head to attractions around the camera while his body is back to it, video frames recorded during this period are labeled negative by the proposed method and his face may be missed. In another case, when subjects move out of camera FOV while still facing the opposite of camera direction, frames recorded at this moment are labeled positive while in fact the subject is not in them and the tracking analysis over these frames is wasted. Third, the proposed method does not apply to video archives created in the past due to the absence of contextual metadata. A lot of work needs to be done to make the method better. In the future, we plan to explore the possibility of applying more other wearable sensors to the field of content analysis of visual data.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is partially supported by the Project on the Architecture, Key Technology Research and Demonstration of Web-Based Wireless Ubiquitous Business Environment (no. 2012ZX03005008-001), the National Natural Science Foundation of China (Grant nos. 61202436 and 61271041), Natural Science Foundation of Jiangsu Province, China (Grant no. BK20130164), and EU funded iCore project, “Internet Connected Objects for Reconfigurable Ecosystems.”

References

Yilmaz

Javed

Shah

Object tracking: a survey

ACM Computing Surveys 2006 38 4, article 13

10.1145/1177352.1177355

2-s2.0-33846013241

Ross

D. A.

Lim

Lin

R.-S.

Yang

M.-H.

Incremental learning for robust visual tracking

International Journal of Computer Vision 2008 77 1–3 125 141

10.1007/s11263-007-0075-7

2-s2.0-39749173057

Yang

M.-H.

Kriegman

D. J.

Ahuja

Detecting faces in images: a survey

IEEE Transactions on Pattern Analysis and Machine Intelligence 2002 24 1 34 58

10.1109/34.982883

2-s2.0-0036223025

Ahonen

Hadid

Pietikäinen

Face description with local binary patterns: application to face recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence 2006 28 12 2037 2041

10.1109/tpami.2006.244

2-s2.0-33947167478

Liao

W.-H.

Region description using extended local ternary patterns

Proceedings of the 20th International Conference on Pattern Recognition (ICPR '10)

August 2010

IEEE

1003 1006

10.1109/icpr.2010.251

2-s2.0-78149484189

Dalal

Triggs

Histograms of oriented gradients for human detection

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05)

June 2005

San Diego, Calif, USA

886 893

10.1109/cvpr.2005.177

2-s2.0-33645146449

Viola

Jones

Rapid object detection using a boosted cascade of simple features

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '01)

2001

1 511

Lepetit

Fua

Monocular Model-Based 3D Tracking of Rigid Objects 2005

Now Publishers Inc.

Lucas

B. D.

Kanade

An iterative image registration technique with an application to stereo vision

Proceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI '81)

1981

674 679

10.

Comaniciu

Ramesh

Meer

Real-time tracking of non-rigid objects using mean shift

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2000

142 149

11.

Breitenstein

M. D.

Reichlin

Leibe

Koller-Meier

van Gool

Robust tracking-by-detection using a detector confidence particle filter

Proceedings of the 12th International Conference on Computer Vision (ICCV '09)

October 2009

IEEE

1515 1522

10.1109/iccv.2009.5459278

2-s2.0-77953208851

12.

Okuma

Taleghani

de Freitas

Little

J. J.

Lowe

D. G.

A boosted particle filter: multitarget detection and tracking

Proceedings of the 8th European Conference on Computer Vision (ECCV '04)

2004

Springer

28 39

10.1007/978-3-540-24670-1_3

13.

Avidan

Support vector tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence 2004 26 8 1064 1072

10.1109/tpami.2004.53

2-s2.0-3242681758

14.

Grabner

Bischof

On-line boosting and vision

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06)

June 2006

New York, NY, USA

260 267

10.1109/cvpr.2006.215

2-s2.0-33845570001

15.

Avidan

Ensemble tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence 2007 29 2 261 271

10.1109/TPAMI.2007.35

2-s2.0-33947229323

16.

OpenCV

http://opencv.org

17.

Elgammal

Muang

Skin detection

Encyclopedia of Biometrics 2009

Springer

1218 1224

18.

Kalal

Matas

Mikolajczyk

P-N learning: bootstrapping binary classifiers by structural constraints

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '10)

June 2010

San Francisco, Calif, USA

49 56

10.1109/cvpr.2010.5540231

2-s2.0-77956005443

19.

Kalal

Mikolajczyk

Matas

Tracking-learning-detection

IEEE Transactions on Pattern Analysis and Machine Intelligence 2012 34 7 1409 1422

10.1109/TPAMI.2011.239

2-s2.0-84861312439

20.

Nebehay

Robust Object Tracking Based on Tracking-Learning-Detection 2012

Vienna, Austria

Faculty of Informatics, Vienna University of Technology

Sensor-Assisted Face Tracking

Abstract

1. Introduction

2. Related Work

2.1. Face Detection

2.2. Object Tracking

3. Problem Formulation

4. Proposed Method

4.1. Sensor Description

4.2. Frame Labeling

Algorithm 1: Frame labeling.

4.3. Face Tracking

Algorithm 2: Face classification.

Algorithm 3: Track by detection.

Algorithm 4: Track by mean shift.

Algorithm 5: Track by TLD.

5. Experiments

5.1. Tracking Optimization

5.2. Tracking Comparison

6. Conclusion

Footnotes

Conflict of Interests

Acknowledgments

References