Learning Discriminative Transferable Sparse Coding for Cross-View Action Recognition in Wireless Sensor Networks

Abstract

Human action recognition in wireless sensor networks (WSN) is an attractive direction due to its wide applications. However, human actions captured from different sensor nodes in WSN show different views, and the performance of classifier tends to degrade sharply. In this paper, we focus on the issue of cross-view action recognition in WSN and propose a novel algorithm named discriminative transferable sparse coding (DTSC) to overcome the drawback. We learn the sparse representation with an explicit discriminative goal, making the proposed method suitable for recognition. Furthermore, we simultaneously learn the dictionaries from different sensor nodes such that the same actions from different sensor nodes have similar sparse representations. Our method is verified on the IXMAS datasets, and the experimental results demonstrate that our method achieves better results than that of previous methods on cross-view action recognition in WSN.

1. Introduction

Recent advances in wireless communications and electronics have encouraged the advent of massively distributed wireless sensor networks (WSN) [1, 2]. The WSN consists of a large number of small, low cost, and low power sensor nodes, which collect and disseminate environmental data [3]. WSN has a wide range of applications, such as surveillance systems, guiding systems, biological detection, habitat, agriculture, and health monitoring [4, 5]. Video surveillance in WSN is an attractive direction which leads to lots of researches [6, 7]. In this case, each sensor node is a surveillance camera. Human action recognition is a key technique of video surveillance in WSN, which has been widely studied over the past several years. A large number of approaches have been proposed to make action representation more discriminative, such as space-time pattern templates [8], 2D shape matching [9, 10], trajectory-based representation [11], optical flow patterns [12], spatiotemporal interest points [13, 14], and attribute-based methods [15]. In the bag-of-words (BOW) model framework, methods which extract features using spatiotemporal interest points have shown promising performance. These methods are relatively robust to illumination variation, background changing, and noise, because they do not rely on preprocessing techniques, for example, trajectory tracking or motion detection. Moreover, another kind of methods [16–18] inspired by this model exploits the spatial and temporal contexts as another type of feature for describing interest points. These approaches are effective for recognizing actions observed from similar views, but when human actions captured from different sensor nodes in WSN show different viewpoints, their performance tends to degrade sharply. It is because the same action looks very different when observed from different sensor nodes (views). Therefore, action models learned using labeled samples in one sensor node are less discriminative for recognizing actions in a different sensor node. The intuitional approach is training a separate classifier for each sensor node, but there are too many sensor nodes in WSN leading to lack of labeled samples.

A large number of approaches have been proposed to address the problem of action recognition captured from different sensor nodes in WSN (also called cross-view action recognition). Some of these approaches employ some existing techniques, such as geometric constraints [8], body joints detection and tracking [19, 20], and 3D models [21, 22]. For instance, Rao et al. [20] presented a view-invariant representation of human action to capture the dramatic changes in the speed and direction of the trajectory using spatiotemporal curvature of 2D trajectory. Nevertheless, the above approaches rely either on body joints detection and tracking or alignment between views which are limited in practice. Junejo et al. [23, 24] explored a self-similarity matrix of action sequences, which was high stability under view changes. Chen and Grauman [25] proposed to form a 3D appearance tensor indexed by pose examples, viewpoints, and image positions, which can infer unseen view examples.

Recently, transfer learning approaches are employed to address cross-view action recognition. Farhadi and Tabrizi [28] generated split-based features in the source view using maximum margin clustering and then transferred the split values to the corresponding frames in the target view. Liu et al. [26] learned a cross-view bag of bilingual words using corresponding pairs. Then, the action videos are represented by bilingual words in both views. Li and Zickler [27] assumed that there is a virtual path connecting the source view with the target view. Each point on this virtual path is defined as a virtual view. Then, several virtual views are sampled to form a single long feature that is robust to view variations. Zhang et al. [29] sampled all the virtual views on the virtual path and integrated them into an infinite-dimensional feature. Correspondingly, a virtual view kernel is proposed to measure the similarity between two infinite-dimensional features.

In this paper, we propose a novel approach for cross-view action recognition in WSN by learning discriminative transferable sparse coding (DTSC). We consider the actions observed simultaneously in both source and target views with labels (corresponding pairs) and each sensor node corresponds to one view. The target of our approach is to train a model using a small amount of corresponding pairs in the source view and test the model in the target view. For making the DTSC suitable for recognition, the sparse representation is learned with an explicit discriminative goal. Concretely, the discriminative power of sparse coefficients depends on twofold. First, the sparse coefficients should well represent the actions using the corresponding subdictionary. Second, the product value between the sparse coefficient and subdictionary from different class is expected to be zero. To this end, we add a constraint on both sparse coefficients and subdictionary.

In the implementation of training process, we first construct a separate dictionary for each view utilizing k-means algorithm and represent action videos using bag-of-words (BOW) model. Although each pair of videos captures the same action from two views, the feature representations of an action in the two views are different. It is because each action is built on its own dictionary independently. In order to transfer knowledge from one view (sensor node) to another one, we simultaneously learn the dictionaries from different views such that the same actions from different views have similar sparse representations. The main idea is illustrated in Figure 1.

Figure 1

Learning discriminative transferable sparse coding in WSN. We force the same actions from different sensor nodes to have similar sparse representations and expect the product value between the sparse coefficient and subdictionary from different class to be zero.

The rest of this paper is organized as below. We review the sparse coding method in Section 2. Then, we present our DTSC algorithm in Section 3. Section 4 shows the experimental results which outperform the state-of-the-art methods on the IXMAS multiview dataset. Finally, in Section 5, we conclude the paper.

2. Sparse Coding

Sparse coding (or sparse representation) is a powerful tool for statistical signal processing, and it has already been widely applied in many fields [30, 31]. The success of sparse coding largely owes to the fact that natural signals are intrinsically sparse in some domain. This model sparsely encodes a signal over an overcomplete dictionary and classifies the signal based on its coding vector. Sparse coding modeling of data assumes an ability to describe the signals as a linear combination of a few atoms from an overcomplete dictionary.

For a signal $x \in R^{K \times 1}$ , we say that x has a sparse approximation over a dictionary $D = [d_{1}, d_{2}, \dots, d_{M}] \in R^{K \times M}$ , where $K ≪ M$ . Then, the sparsest coding of x over D is the solution of

\begin{matrix} \underset{a}{m i n} {‖a‖}_{0} \\ s . t . {‖x - D a‖}^{2} \leq ε, \end{matrix}

(1)

where

{∥ \cdot ∥}_{0}

denotes the

l_{0}

-norm which counts the number of nonzero elements in a vector and ɛ is a predefined parameter with a small value. Given the dictionary D, the model tries to seek the sparsest representation for the signal x. However, the solution of (1) is an NP-hard problem. Some recent work shows that this problem can be tackled by replacing the

l_{0}

-norm with

l_{1}

-norm regularization [32]. In many applications, the dictionary D is unknown and we need to construct it from training data

X = [x_{1}, x_{2}, \dots, x_{N}]

. The dictionary D, as well as the sparse coding coefficient

a_{i}

, can be learned by optimizing the following objective function:

\begin{matrix} \underset{D, A}{m i n} \sum_{i = 1}^{N} {‖x_{i} - D a_{i}‖}^{2} + λ {‖a_{i}‖}_{1} \\ s . t . {‖d_{k}‖}^{2} \leq 1, \forall k = 1,2, \dots, M, \end{matrix}

(2)

where

{‖x_{i} - D a_{i}‖}^{2}

denotes the reconstruction error,

A = [a_{1}, a_{2}, \dots, a_{N}]

(a_{i} \in R^{M \times 1})

, and λ is the regularization parameter controlling the sparsity of the coefficient vector.

{∥ \cdot ∥}_{1}

denotes the

l_{1}

-norm which counts the sum of the absolute value of each element in

a_{i}

, and the unit

l_{2}

-norm constraint on

d_{k}

is to avoid trivial solutions.

3. Discriminative Transferable Sparse Coding

The sparse coding, however, is unsuitable for cross-view action recognition in WSN. First, sparse coding is an unsupervised learning algorithm which neglects the discriminative information among action categories. Second, sparse coding is not robust to view variance because the feature representations of an action in the two sensor nodes are significantly different. To overcome these drawbacks, we propose a novel coding strategy named discriminative transferable sparse coding (DTSC). We expect that the sparse coding coefficients not only possess discriminative power, but also are similar for the same action in different views.

3.1. Sparse Coding Based on Discriminative Learning

Let $X_{i} = [x_{i 1}, x_{i 2}, \dots, x_{i N}]$ denote feature vectors of actions from the ith classes, where $x_{i j} \in R^{K \times 1}$ is the jth feature vector of the ith class. We learn a subdictionary $D_{i} \in R^{K \times M}$ for the ith class, and then we get a structured dictionary $D = [D_{1}, D_{2}, \dots, D_{C}]$ , where C is the number of classes. $A_{i} = [a_{i 1}, a_{i 2}, \dots, a_{i N}]$ are the sparse coefficients of the ith class learned based on $D_{i}$ , where $a_{i j} \in R^{M \times 1}$ is the jth sparse coefficient of the ith class. The discriminative power of sparse coefficients depends on twofold. First, the sparse coefficients should well represent the feature vectors using the corresponding subdictionary; that is, $x_{i j} = D_{i} a_{i j}$ . Second, for the feature vectors from class j $(j \neq i)$ , $a_{j i}$ should have nearly zero coefficients such that $\sum_{j = 1, j \neq i}^{C} ‍ {‖D_{i} a_{j i}‖}^{2}$ is small, which means that there is no significant correlation between $D_{i}$ and $x_{j i}$ . For the sake of computational convenience, we employ $\sum_{c = 1, c \neq i}^{C} ‍ {‖D_{c} a_{i j}‖}^{2}$ to compute the correlation. Hence, the sparse coding based on discriminative learning is formulated as

\begin{matrix} \underset{D_{i}, A_{i}}{m i n} \sum_{j = 1}^{N} {‖x_{i j} - D_{i} a_{i j}‖}^{2} + λ {‖a_{i j}‖}_{1} + ρ \sum_{c = 1, c \neq i}^{C} {‖D_{c} a_{i j}‖}^{2} \\ s . t . {‖d_{k}‖}^{2} \leq 1, \forall k = 1,2, \dots, M, \end{matrix}

(3)

where ρ is a balancing parameter controlling the importance of discrimination.

3.2. Sparse Coding Based on Transfer Learning

Another goal of our DTSC is to transfer knowledge from one sensor node (source view) to another one (target view) using corresponding pairs. We force each pair of videos of the same action observed from the source and target views have the same sparse coefficient. To this end, we construct different subdictionaries for each class in different views. Thus, the sparse coding based on transfer learning is formulated as

\begin{array}{l} \underset{D_{i}^{S}, D_{i}^{T}, A_{i}}{m i n} \sum_{j = 1}^{N} {‖x_{i j}^{S} - D_{i}^{S} a_{i j}‖}^{2} + {‖x_{i j}^{T} - D_{i}^{T} a_{i j}‖}^{2} \\ + λ {‖a_{i j}‖}_{1} + ρ \sum_{c = 1, c \neq i}^{C} ({‖D_{c}^{S} a_{i j}‖}^{2} + {‖D_{c}^{T} a_{i j}‖}^{2}) \\ s . t . {‖d_{k}‖}^{2} \leq 1, \forall k = 1,2, \dots, M, \end{array}

(4)

where the superscripts S and T denote the source and target views, respectively, and

x_{i j}^{S}

and

x_{i j}^{T}

are feature vectors of a corresponding pair. Equation (4) simultaneously learns the subdictionaries of the ith class for the source and target views, such that

x_{i j}^{S}

and

x_{i j}^{T}

have the same coefficient

a_{i j}

. Since we have the same number of labeled actions in the source and target views, the objective function of our DTSC is given by

\begin{matrix} \underset{{\tilde{D}}_{i}, A_{i}}{m i n} \sum_{j = 1}^{N} {‖{\tilde{x}}_{i j} - {\tilde{D}}_{i} a_{i j}‖}^{2} + λ {‖a_{i j}‖}_{1} + ρ \sum_{c = 1, c \neq i}^{C} {‖{\tilde{D}}_{c} a_{i j}‖}^{2} \\ s . t . {‖{\tilde{d}}_{k}‖}^{2} \leq 1, \forall k = 1,2, \dots, M, \end{matrix}

(5)

where

{\tilde{x}}_{i j} = [x_{i j}^{S}; x_{i j}^{T}]

{\tilde{D}}_{i} = [D_{i}^{S}; D_{i}^{T}]

and

{\tilde{D}}_{c} = [D_{c}^{S}; D_{c}^{T}]

. From (5), we can see that our DTSC not only possesses discriminative power, but is also robust to view variance.

3.3. Solution of DTSC

The optimization of DTSC model can be conducted by alternatively optimizing ${\tilde{D}}_{i}$ and $A_{i}$ for each class. When ${\tilde{D}}_{i}$ is fixed, this optimization problem (5) can be executed by optimizing over each coefficient $a_{i j}$ individually

\begin{matrix} \min_{a_{ij}} {‖{\tilde{x}}_{i j} - {\tilde{D}}_{i} a_{i j}‖}^{2} + λ {‖a_{i j}‖}_{1} + ρ \sum_{c = 1, c \neq i}^{C} {‖{\tilde{D}}_{c} a_{i j}‖}^{2} . \end{matrix}

(6)

This is a linear regression problem with $l_{1}$ -norm regularization on the coefficients. The optimization can be solved very efficiently by the feature-sign search algorithm [33]. After optimizing each $a_{i j}$ , $i = 1, \dots, N$ , the coefficient matrix $A_{i}$ is updated. Once the coefficient matrix $A_{i}$ is updated, we update the dictionary ${\tilde{D}}_{i}$ which can be handled by a least square problem with quadratic constraints as

\begin{matrix} \underset{{\tilde{D}}_{i}}{m i n} \sum_{j = 1}^{N} {‖{\tilde{x}}_{i j} - {\tilde{D}}_{i} a_{i j}‖}^{2} \\ s . t . {‖{\tilde{d}}_{k}‖}^{2} \leq 1, \forall k = 1,2, \dots, M . \end{matrix}

(7)

This can be efficiently solved by using the Lagrange dual method [33]. It should be noted that, when optimizing the ith class subdictionary

{\tilde{D}}_{i}

, the subdictionaries of other classes keep invariant. In summary, the whole optimization process is described in Algorithm 1.

Algorithm 1: Solution of DTSC.

Input: ${\tilde{x}}_{i j} = [x_{i j}^{S}; x_{i j}^{T}], j = 1,2, . . ., N$ , N is the number of corresponding pairs from the ith class;

parameters λ, ρ

Output: ${\tilde{D}}_{i}, A_{i}$

Initialize Obtain sub-dictionary ${\tilde{D}}_{i}$ by k-means clustering algorithm;

while ${\tilde{D}}_{i} \neq {\tilde{D}}_{i}^{new}$ do

(1) Fix ${\tilde{D}}_{i}$ , and then optimize $A_{i}$ ,

for $i = 1 : N$ do

solve $a_{i j}$ by (6)

end

(2) Fix $A_{i}$ , and optimize ${\tilde{D}}_{i}$ , which can be optimized by (7). ${\tilde{D}}_{i}^{new}$ is obtained.

end

3.4. Feature Representation

In the training stage, we learn a subdictionary ${\tilde{D}}_{i} = [D_{i}^{S}; D_{i}^{T}]$ , $i = 1, \dots, C$ for each class in the source and target views by utilizing Algorithm 1. Then, we concatenate all the subdictionaries from C classes in the source view to obtain $D_{S} = [D_{1}^{S}, D_{2}^{S}, \dots, D_{C}^{S}]$ . Based on the dictionary $D_{S}$ , we obtain sparse feature representations of the training action videos in the source view. Finally, we train a SVM classifier using these samples. In the testing stage, we obtain the dictionary $D^{T}$ in the same way. Given a testing action video, the sparse feature is calculated based on $D^{T}$ . We use the trained SVM classifier to label the testing sample.

4. Experimental Results

4.1. Dataset and Low-Level Feature Extraction

We verify our DTSC on the IXMAS multiview action dataset [22], which consists of eleven daily-life actions, such as kick, point, and cross arms. Each action is performed three times by twelve actors and observed from five different views including four side views and one top view, where each view corresponds to one sensor node.

For fair comparison, we extract the same low-level action descriptors as [26, 27]. Specifically, we first extract the local feature, that is, the spatiotemporal interest points proposed in [13]. To detect the interest points, a 2D Gaussian filter and then a 1D-Gabor filter are applied to an action video, and the interest points are detected at the local maximum response. We extract up to 200 cuboids from each action video. Each cuboid is represented by a 100-dimensional descriptor learned by PCA. These descriptors are further quantized to 1000 codewords by k-means clustering and each action video is represented by a histogram using bag-of-words model [34]. To complement the local feature, we then extract global shape-flow feature [35]. Specifically, three channels features are extracted from each frame: horizontal optical flow, vertical optical flow, and silhouette. Then PCA is again used to reduce the dimensionality. Descriptors from neighboring frames are concatenated with the current frame descriptor to incorporate temporal information. The histogram vector is built over 500 quantized codewords. Finally, for each action video, we concatenate the local and global features to form a 1500-dimensional feature vector.

4.2. Pairwise Cross-View Recognition in WSN

Our algorithm is evaluated on all possible pairwise view combinations. For an accurate comparison to [26, 27], we follow the same leave-one-action-class-out strategy for choosing the orphan action which means that each time we only consider one action class for testing in the target view. The final results are reported according to average accuracy for all action classes in each view. It is noticeable that the orphan action class is not used to train the dictionary and establishes corresponding pairs. The corresponding pairs are randomly chosen from the training samples and these pairs account for 30% of the nonorphan samples. We set $λ = 0.4$ and $ρ = 5$ in (5).

Table 1 lists the recognition accuracy for all possible source-target view combinations. We compare our DTSC with the method without transfer learning [26, 27]. Note that we omit the accuracy of [28, 36], since they report lower results than [26, 27] in most view combinations. Some conclusions can be drawn from Table 1. First, our DTSC outperforms all five possible target views with varying source views on average recognition accuracies, which can be seen in the last row of Table 1. Second, the method without transfer learning performs very poorly and the recognition accuracy of most combinations is less than 50%, while our DTSC achieves very high accuracy which demonstrates the effectiveness of transfer learning. Third, our algorithm obtains the highest recognition accuracy in all the 20 view combinations. It is because our DTSC not only is able to transfer knowledge between views (sensor nodes), but also possesses discriminative power.

Table 1

Cross-view recognition accuracy on the IXMAS dataset. Each row is a source view and each column a target view. The four accuracy numbers in a tuple are the average recognition accuracy of method without transfer learning, [26, 27] and DTSC respectively.

%	$C 0$	$C 1$	$C 2$	$C 3$	$C 4$
$C 0$		(26.4, 79.9, 81.8, 87.4)	(24.6, 76.8, 88.1, 93.6)	(20.3, 76.8, 87.5, 92.3)	(27.9, 74.8, 81.4, 87.1)
$C 1$	(31.2, 81.2, 87.5, 91.6)		(23.0, 75.8, 82.0, 88.4)	(23.0, 78.0, 92.3, 93.1)	(20.3, 70.4, 74.2, 85.2)
$C 2$	(23.3, 79.6, 85.3, 91.4)	(20.9, 76.6, 82.6, 85.7)		(13.0, 79.8, 82.6, 86.3)	(17.9, 72.8, 76.5, 83.4)
$C 3$	(9.7, 73.0, 82.1, 87.5)	(24.9, 74.1, 81.5, 86.6)	(23.0, 74.4, 80.2, 86.8)		(16.7, 71.2, 70.0, 78.1)
$C 4$	(51.2, 82.0, 78.8, 87.2)	(38.2, 68.3, 73.8, 77.3)	(41.2, 74.0, 77.7, 83.7)	(53.3, 71.1, 78.7, 84.5)
Ave.	(28.9, 79.0, 83.4, 89.4)	(27.6, 74.7, 79.9, 84.3)	(28.0, 75.2, 82.0, 88.1)	(27.4, 76.4, 85.3, 89.0)	(20.7, 71.2, 75.5, 83.5)

4.3. Influence of Parameters Variances

We further evaluate the performance of the proposed DTSC with respect to λ and ρ in (5) which control the sparsity of the coefficient vector and the importance of discrimination. From Table 2, the experimental results indicate that when $λ = 0.4$ and $ρ = 5$ , results are the best. These optimal parameters are adopted in all the experiments.

Table 2

Cross-view action recognition accuracy (%) under different λ and ρ.

ρ	λ
ρ	0.1	0.4	1	5
0.5	82.1	83.6	84.0	82.3
1	83.4	84.5	84.2	82.4
5	85.4	86.9	85.5	83.2
10	84.8	85.6	83.8	82.3

5. Conclusions

In this paper, we propose a discriminative transferable sparse coding approach (DTSC) for cross-view action recognition in WSN. The proposed DTSC simultaneously considers the discrimination and transferability of sparse representation. For the sake of discriminative sparse representation, we expect the product value between the sparse coefficient and subdictionary from different classes to be zero. To learn the transferable sparse representation, we force the same actions from different sensor nodes to have similar sparse coefficients. The experimental results demonstrate that our method achieves better results than that of previous methods in cross-view action recognition in WSN.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant no. 61401309 and no. 61202327 and the Doctoral Fund of Tianjin Normal University under Grant no. 5RL134 and no. 52XB1405.

References

Yick

Mukherjee

Ghosal

Wireless sensor network survey

Computer Networks 2008 52 12 2292 2330

10.1016/j.comnet.2008.04.002

2-s2.0-46449122114

Liang

Design and analysis of distributed radar sensor networks

IEEE Transactions on Parallel and Distributed Systems 2011 22 11 1926 1933

10.1109/TPDS.2011.45

2-s2.0-80053576090

Liang

Situation understanding based on heterogeneous sensor networks and human-inspired favor weak fuzzy logic system

IEEE Systems Journal 2011 5 2 156 163

10.1109/JSYST.2010.2090404

2-s2.0-79957677491

Demirbas

Chow

K. Y.

Wan

C. S.

INSIGHT: internet-sensor integration for habitat monitoring

Proceedings of the International Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM ′06)

June 2006

553 558

10.1109/wowmom.2006.52

2-s2.0-33845953461

Burrell

Brooke

Beckwith

Vineyard computing: Sensor networks in agricultural production

IEEE Pervasive Computing 2004 3 1 38 45

10.1109/mprv.2004.1269130

2-s2.0-2342482992

Petrushin

V. A.

Wei

Shakil

Roqueiro

Gershman

A. V.

Multiple-sensor indoor surveillance system

Proceedings of the 3rd Canadian Conference on Computer and Robot Vision (CRV ′06)

June 2006

40 47

10.1109/crv.2006.50

2-s2.0-33845373171

Snidaro

Foresti

G. L.

A multi-camera approach to sensor evaluation in video surveillance

Proceedings of the IEEE International Conference on Image Processing (ICIP '05)

September 2005

1101 1104

10.1109/icip.2005.1529947

2-s2.0-33749619544

Yilmaz

Shah

Actions sketch: a novel action representation

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ′05)

June 2005

984 989

10.1109/cvpr.2005.58

2-s2.0-33745142597

Lin

Jiang

Davis

L. S.

Recognizing actions by shape-motion prototype trees

Proceedings of the IEEE 12th International Conference on Computer Vision (ICCV ′09)

September 2009

444 451

10.1109/iccv.2009.5459184

10.

Nevatia

Single view human action recognition using key pose matching and viterbi path searching

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ′07)

June 2007

Minneapolis, Minn, USA

IEEE

1 8

10.1109/cvpr.2007.383131

2-s2.0-34948833676

11.

Raptis

Soatto

Tracklet descriptors for action modeling and video analysis

Computer Vision—ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5–11, 2010, Proceedings, Part I 2010 6311

Berlin, Germany

Springer

577 590 Lecture Notes in Computer Science

10.1007/978-3-642-15549-9_42

12.

Efros

A. A.

Berg

A. C.

Mori

Malik

Recognizing action at a distance

IEEE International Conference on Computer Vision

October 2003

726 733

2-s2.0-0344983342

13.

Dollár

Rabaud

Cottrell

Belongie

Behavior recognition via sparse spatio-temporal features

Proceedings of the 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS ′05)

October 2005

65 72

10.1109/vspets.2005.1570899

2-s2.0-33846622081

14.

Liu

Yang

Shah

Learning semantic visual vocabularies using diffusion distance

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops '09)

June 2009

Miami, Fla, USA

461 468

10.1109/cvprw.2009.5206845

2-s2.0-70450170628

15.

Zhang

Wang

Xiao

Zhou

Liu

Attribute regularization based human action recognition

IEEE Transactions on Information Forensics and Security 2013 8 10 1600 1609

10.1109/TIFS.2013.2258152

2-s2.0-84884513342

16.

Kovashka

Grauman

Learning a hierarchy of discriminative space-time neighborhood features for human action recognition

IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ′10)

June 2010

2046 2053

10.1109/cvpr.2010.5539881

2-s2.0-77955993558

17.

Zhang

Wang

Xiao

Zhou

Liu

Action recognition using context-constrained linear coding

IEEE Signal Processing Letters 2012 19 7 439 442

10.1109/lsp.2012.2191615

2-s2.0-84862279710

18.

Liang

Cheng

Samn

S. W.

NEW: network-enabled electronic warfare for target recognition

IEEE Transactions on Aerospace and Electronic Systems 2010 46 2 558 568

10.1109/taes.2010.5461641

2-s2.0-77952687866

19.

Parameswaran

Chellappa

View invariance for human action recognition

International Journal of Computer Vision 2006 66 1 83 101

10.1007/s11263-005-3671-4

2-s2.0-29344442898

20.

Rao

Yilmaz

Shah

View-invariant representation and recognition of actions

International Journal of Computer Vision 2002 50 2 203 226

10.1023/a:1020350100748

2-s2.0-0036844204

21.

Tian

T.-P.

Sclaroff

Simultaneous learning of nonlinear manifold and dynamical models for high-dimensional time series

Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV ′07)

October 2007

Rio de Janeiro, Brazil

1 8

10.1109/iccv.2007.4409044

2-s2.0-50949112000

22.

Weinland

Boyer

Ronfard

Action recognition from arbitrary views using 3D exemplars

Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV ′07)

October 2007

1 7

10.1109/iccv.2007.4408849

2-s2.0-50649108934

23.

Junejo

I. N.

Dexter

Laptev

Pérez

Cross-view action recognition from temporal self-similarities

Proceedings of the European Conference on Computer Vision

October 2008

293 306

24.

Junejo

I. N.

Dexter

Laptev

Pérez

View-independent action recognition from temporal self-similarities

IEEE Transactions on Pattern Analysis and Machine Intelligence 2011 33 1 172 185

10.1109/tpami.2010.68

2-s2.0-78649327362

25.

Chen

C.-Y.

Grauman

Inferring unseen views of people

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '11)

June 2014

Columbus, Ohio, USA

2011 2018

10.1109/cvpr.2014.258

26.

Liu

Shah

Kuipers

Savarese

Cross-view action recognition via view knowledge transfer

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ′11)

June 2011

3209 3216

10.1109/cvpr.2011.5995729

2-s2.0-80052904932

27.

Zickler

Discriminative virtual views for cross-view action recognition

IEEE Conference on Computer Vision and Pattern Recognition (CVPR ′12)

June 2012

Providence, RI, USA

2855 2862

10.1109/CVPR.2012.6248011

28.

Farhadi

Tabrizi

M. K.

Learning to recognize activities from the wrong view point

Proceedings of the European Conference on Computer Vision

2008

154 166

29.

Zhang

Wang

Xiao

Zhou

Liu

Shi

Cross-view action recognition via a continuous virtual path

Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR ′13)

June 2013

2690 2697

10.1109/cvpr.2013.347

2-s2.0-84887384470

30.

Wright

Yang

A. Y.

Ganesh

Sastry

S. S.

Robust face recognition via sparse representation

IEEE Transactions on Pattern Analysis and Machine Intelligence 2009 31 2 210 227

10.1109/tpami.2008.79

2-s2.0-61549128441

31.

Yang

Gong

Huang

Linear spatial pyramid matching using sparse coding for image classification

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ′09)

June 2009

1794 1801

10.1109/cvprw.2009.5206757

2-s2.0-70450209196

32.

Donoho

D. L.

For most large underdetermined systems of linear equations the minimal

l_{1}

-norm solution is also the sparsest solution

Communications on Pure and Applied Mathematics 2006 59 6 797 829

10.1002/cpa.20132

MR2217606

2-s2.0-33646365077

33.

Lee

Battle

Raina

A. Y.

Efficient sparse coding algorithms

Advances in Neural Information Processing Systems 2006 801 808

34.

Wang

Ullah

M. M.

Kläser

Laptev

Schmid

Evaluation of local spatio-temporal features for action recognition

Proceedings of the 20th British Machine Vision Conference (BMVC ′09)

September 2009

10.5244/c.23.124

2-s2.0-84898890371

35.

Tran

Sorokin

Human activity recognition with metric learning

Proceedings of the European Conference on Computer Vision

October 2008

548 561

36.

Farhadi

Tabrizi

M. K.

Endres

Forsyth

A latent model of discriminative aspect

Proceedings of the IEEE 12th International Conference on Computer Vision (ICCV ′09)

October 2009

948 955

10.1109/iccv.2009.5459350

2-s2.0-77953218597