Local Random Sparse Coding for Human Action Recognition in Wireless Sensor Networks

Abstract

Recognizing human action in wireless sensor networks (WSN) has raised a great interest owing to the requirements of real-world applications. Recently, the bag-of-features model (BOF) has proved effective in human action recognition. In this paper, we propose a novel method named local random sparse coding (LRSC) for human action recognition in WSN based on the BOF model. The contribution is twofold. First, we utilize random projection (RP) technique for each feature vector to alleviate the curse of dimensionality. Second, we consider the locality of codebook and correspondingly propose to reconstruct the features using similar codewords. Our method is verified on the KTH and UCF Sports databases, and the experimental results demonstrate that our method achieves better results than that of previous methods on human action recognition in WSN.

1. Introduction

The proliferation of wireless communications and electronics has created the opportunities for development of wireless sensor networks (WSN) [1, 2]. The WSN consists a variety of sensors, such as video cameras, microphones, infrared badges, and RFID tags, which drives the applications of WSN in the fields of surveillance systems, guiding systems, biological detection, habitat, agriculture, and health monitoring [3, 4]. Surveillance for abnormal event detection and monitoring elderly and sick people at home are examples of some applications that require the ability to automatically recognize human actions in WSN where each sensor node is a surveillance camera. Recognizing human action is also an essential issue in computer vision and pattern recognition. It mainly focuses on how to build a discriminative and compact representation for human actions in video.

Recently, some approaches are proposed to utilize local spatiotemporal descriptors together with bag-of-features model (BOF) [5–7] to represent the action, which have shown promising results. Since the BOF representation does not rely on preprocessing techniques, for example, trajectory tracking or motion detection, it is relatively robust to illumination variation, background changing, and noise. Nevertheless, there are three limitations for the BOF representation. The first limitation is the quantization error when generating a codebook. To solve this drawback, Wang et al. [8] proposed locality-constrained linear coding (LLC) which used several nearest codewords to linearly encode a descriptor in Euclidean domain. Yang et al. [9] proposed the ScSPM method where sparse coding was used instead of hard vector quantization to obtain kinds of nonlinear coding. The second limitation is the high dimensionality of local descriptors which results in the curse of dimensionality. A lot of approaches have been proposed to project high dimensional features into a lower dimensional subspace, including principal component analysis (PCA) and linear discriminant analysis (LDA). The last limitation is that the BOF representation neglects the local information because it counts the histogram of the whole action video.

In this paper, we propose a novel method named local random sparse coding (LRSC) for human action recognition in WSN to overcome the drawbacks of BOF representation. Since the proposed LRSC inherits the property of sparse coding, it can alleviate the quantization error. For all the features extracted from training action videos, we utilize random projection (RP) technique of projecting a set of descriptors from a high dimensional space to a randomly chosen low dimensional subspace. The theory of compressed sensing has demonstrated the effectiveness of RP technique for the information-preserving and dimensionality reduction power [10, 11]. Moreover, the RP technique is faster than the other dimensionality reduction technique, because the low dimensional features are directly obtained by multiplying a random projection matrix. When reconstructing the features, the codewords may be very heterogeneous, possibly leading to the loss of a large amount of information. Thus, we propose to consider the locality of codebook. Concretely, we utilize the similar feature vectors in the input space to construct the codebooks. The experimental results demonstrate the effectiveness of our LRSC.

The rest of this paper is organized as follows. We present our LRSC algorithm in Section 2. Section 3 shows the experimental results which outperform the state-of-the-art methods on the KTH and UCF Sports databases. Finally, in Section 4, we conclude the paper.

2. Approach

2.1. Method Overview

The proposed LRSC consists of six stages as shown in Figure 1: (a) detection of spatiotemporal interest points in action videos, (b) representation of each spatiotemporal interest point as a feature vector, (c) reduction of the dimensionality of feature vectors using RP technology, (d) consideration of the locality, (e) sparse coding, and (f) concatenation of all the sparse coefficient vectors for the final representation.

Figure 1

The flowchart of the proposed LRSC strategy.

2.2. Detection of Spatiotemporal Interest Points

To detect interest points in action videos, we employ the Harris 3D corner detector proposed in [12] as shown in Figure 1(a), which is an extension of the Harris 2D corner. The Harris 3D detects the location where the video intensities have significant local variations in both space and time. To this end, matrix F is defined as

\begin{matrix} F = [\begin{bmatrix} L_{x}^{2} & L_{x} L_{y} & L_{x} L_{t} \\ L_{x} L_{y} & L_{y}^{2} & L_{y} L_{t} \\ L_{x} L_{y} & L_{y} L_{t} & L_{t}^{2} \end{bmatrix}], \end{matrix}

(1)

where

L_{x}

L_{y}

, and

L_{t}

are the gradients of Gaussian smoothed video in horizontal, vertical, and temporal direction. The Harris 3D corner detector finds points whose F has large eigenvalues.

2.3. Representation of Spatiotemporal Interest Points

After detecting spatiotemporal interest points, we represent each interest point as a feature vector (see Figure 1(b)). For each interest point, the histogram of oriented gradients (HOG) and histogram of optical flow (HOF) are used as local appearance descriptors. Therefore, each interest point is represented as a 162-dimensional feature vector. Let $X = [x_{1}, x_{2}, \dots, x_{N}]$ denote training feature vectors, where $x_{i} \in R^{K} \times 1$ ( $K = 162$ ) is the feature vector corresponding to one interest point. Harris 3D and HoG-HoF are used as the local detector and descriptors because they give a promising performance in the evaluation in [13] and they are simple to implement.

2.4. Random Projection

The dimensionality reduction is a key issue in processing high dimensional data because it alleviates the curse of dimensionality and other undesired properties of high dimensional spaces. Although there are an amount of methods for dimensionality reduction, such as PCA and variations, they are computationally expensive. Random projection is a kind of dimensionality reduction technology, which projects a set of points from a high dimensional space to a randomly chosen low dimensional subspace. The Johnson-Lindenstrauss lemma (JL lemma) [14] gave the theory of random projection, which provides the theoretical basis for dimensionality reduction. The JL lemma declares that a set of points in a high dimensional Euclidean space can be mapped into a low dimensional logarithmic space such that the distances between the points are approximately preserved. Based on this theorem, random projection has been used in a wide variety of applications, such as texture classification [15] and face recognition [16]. Hence, we employ RP in this paper as shown in Figure 1(c).

We choose a random projection matrix Φ to project the high dimensional vectors into a low dimensional subspace $R^{M}$ ( $M ≪ K$ ):

\begin{matrix} y_{i} = Φ x_{i}, \end{matrix}

(2)

where the random matrix

Φ = [ϕ_{i j}] \in R^{M \times K}

and

y_{i}

is a lower dimensional feature vector corresponding to

x_{i}

. Let

Y = [y_{1}, y_{2}, \dots, y_{N}]

denote a set of low dimensional feature vectors of X. The random matrix Φ can be chosen according to Restricted Isometry Property (RIP) [10]. Thus, we choose Gaussian random matrix, whose elements are independent, zeros-mean, and unit-variance Gaussian random variables, as our random projection matrix.

2.5. Locality

When reconstructing the feature vectors, the codewords may be very heterogeneous, which may result in the loss of a large amount of information. We prefer to reconstruct the feature vector using the codewords which are similar to this feature vector. Thus, we propose to consider the locality of codebook. Concretely, we utilize k-means algorithm to cluster all the feature vectors $Y = [y_{1}, y_{2}, \dots, y_{N}]$ into C clusters as shown in Figure 1(d).

2.6. Sparse Coding

Based on the fact that natural signals are intrinsically sparse in some domain, sparse coding has achieved great success in many fields, such as face recognition [17], image classification [9], and nature texture classification [18]. Sparse coding reconstructs the input feature vector as a linear combination of few codewords over an overcomplete dictionary. For each cluster, we represent each feature vector using sparse coding algorithm:

\begin{matrix} \min_{D_{c}, A} \sum_{i = 1}^{N^{c}} {‖y_{i} - D_{c} a_{i}^{c}‖}^{2} + λ {‖a_{i}^{c}‖}_{0}, \end{matrix}

(3)

where

D_{c} = [d_{1}^{c}, d_{2}^{c}, \dots, d_{L}^{c}] \in R^{M \times L}

is an overcomplete dictionary for the cth cluster,

a_{i}^{c}

is the sparse coefficient of

y_{i}

for the cth cluster, and λ is the regularization parameter controlling the sparsity of the coefficient vector.

{‖\cdot‖}_{0}

denotes the

l_{0}

-norm which counts the number of nonzero elements in a vector. The

l_{0}

norm ensures the sparsity of

a_{i}^{c}

. However, the solution of (3) is an NP-hard problem. Some recent work shows that this problem can be tackled by replacing the

l_{0}

-norm with

l_{1}

-norm regularization [19]. Thus, the objective function of sparse coding is reformulated as

\begin{matrix} \min_{D_{c}, A} \sum_{i = 1}^{N^{c}} {‖y_{i} - D_{c} a_{i}^{c}‖}^{2} + λ {‖a_{i}^{c}‖}_{1}, \end{matrix}

(4)

where

{‖\cdot‖}_{1}

denotes the

l_{1}

-norm which counts the sum of the absolute value of each element in

a_{i}^{c}

. In this paper, we construct C codebooks for all the feature vectors.

2.7. Final Representation

The final representation of each interest point is

\begin{matrix} a_{i} = [a_{i}^{1}; a_{i}^{2}; \dots; a_{i}^{C}], \end{matrix}

(5)

where

a_{i}

is the final representation for each interest point as shown in Figure 1(f). Then, we aggregate all the sparse coefficients

a_{i}

using max pooling for each action video. The SVM with linear kernel is utilized to train the classifier.

3. Experimental Results

We prove the effectiveness of our LRSC method on two publicly available databases: KTH dataset [20] and UCF Sports dataset [21]. We compare our algorithm with relevant baselines and other excellent algorithms on human action recognition. There are three algorithms as our relevant baselines: sparse coding (SC), sparse coding with random projection (R + SC), and sparse coding with local clustering (L + SC). The local clustering is controlled by the number of clustering centers, and the random projection is controlled by the dimension of random projection. Thus, the local clustering centers are set to 5 classes, and the dimension of random projection is set to 35. The number of codewords is set to 2000.

The KTH database is a widely used action database. It has 599 action videos, and the actions are in 6 classes including walking, jogging, running, boxing, waving, and clapping. They are performed by 25 subjects under four different scenarios. We adopt the leave-one-out cross validation strategy, specifically 24 videos of actors as training and the remaining one as test videos. The average accuracy values on the KTH database are listed in Table 1, and the confusion table of recognition results on the KTH database is shown in Figure 2. From Table 1, we can see that our LRSC method achieves the best accuracy value of 96.2% on the KTH database. Furthermore, the following four points can be drawn through analyzing the experimental results. First, comparing our LRSC method with L + SC method, we can see that the former is 2.8% higher than the latter one on the accuracy. It shows that the lower dimensional feature vector can capture the intrinsical structure of original space by using RP technology. Second, the accuracy value of LRSC method is 3.4% higher than that of R + SC, because we consider the locality of feature space. Third, our LRSC method gains 5.9% accuracy rate over SC method. It is because our LRSC not only inherits the positive properties of sparse coding but also considers the characteristic of feature space, that is, dimensionality and locality. Finally, from the confusion table, we can see that leg-related actions (running and jogging) are prone to be misclassified. We think the possible reason may be that they always exhibit similar context and appearance.

Table 1

The comparison of our method with the state-of-the-art methods and the baseline methods on the KTH database.

	KTH (%)
Kovashka and Grauman [22]	94.5
Le et al. [23]	93.9
Wang et al. [24]	94.2
Jiang et al. [25]	95.8
Wang et al. [26]	93.3

SC	90.3
R + SC	92.8
L + SC	93.4
LRSC	96.2

Figure 2

Confusion table of our method on the KTH database. The element of row i and column j in confusion table means the percentage of ith class action being recognized as jth class.

The UCF Sports database contains ten action categories resulting in 150 sports videos. This database represents a natural pooling of actions featured in a wide range of scenes and viewpoints, so the videos exhibit great intraclass variation. We take the leave-one-out cross validation, namely, cycling each sample as a test video one at a time. The performances of different methods are listed in Table 2 and the confusion table of recognition results on the UCF Sports dataset is shown in Figure 3. We can see that the proposed LRSC method outperforms the other three baselines and other state-of-the-art methods, reaching 89.3% on the UCF Sports database. The results prove the effectiveness of our LRSC on the realistic and complicated action database once again.

Table 2

The comparison of our method with the state-of-the-art methods and the baseline methods on the UCF Sports database.

	UCF (%)
Kovashka and Grauman [22]	87.3
Le et al. [23]	86.5
Wang et al. [24]	88.2
Jiang et al. [25]	88.0

SC	82.3
R + SC	85.6
L + SC	86.2
LRSC	89.3

Figure 3

Confusion table of our method on the UCF Sports database.

4. Conclusions

In this paper, we propose a novel method named local random sparse coding (LRSC) for human action recognition in WSN. The proposed LRSC inherits the property of sparse coding; it can alleviate the quantization error. Moreover, the proposed LRSC considers the characteristic of feature space, that is, dimensionality and locality. Concretely, we reduce the dimensionality of feature vectors by using RP technology, such that we can capture the intrinsical structure of original space. To consider the locality, k-means clustering algorithm is employed to obtain similar feature vectors. The experimental results demonstrate that our method achieves better results than that of previous methods in human action recognition in WSN.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by National Natural Science Foundation of China under Grant no. 61401309 and Doctoral Fund of Tianjin Normal University under Grants no. 5RL134 and no. 52XB1405.

References

Yick

Mukherjee

Ghosal

Wireless sensor network survey

Computer Networks 2008 52 12 2292 2330

10.1016/j.comnet.2008.04.002

2-s2.0-46449122114

Liang

Cheng

Huang

S. C.

Chen

Opportunistic sensing in wireless sensor networks: theory and application

IEEE Transactions on Computers 2014 63 8 2002 2010

10.1109/tc.2013.85

MR3246633

Petrushin

V. A.

Shakil

Roqueiro

Gang

Gershman

A. V.

Multiple-sensor indoor surveillance system

Proceedings of the 3rd Canadian Conference on Computer and Robot Vision (CRV ′06)

June 2006

40 47

10.1109/crv.2006.50

2-s2.0-33845373171

Liang

Radar sensor wireless channel modeling in foliage environment: UWB versus narrowband

IEEE Sensors Journal 2011 11 6 1448 1457

10.1109/jsen.2010.2097586

2-s2.0-79955414672

Dollár

Rabaud

Cottrell

Belongie

Behavior recognition via sparse spatio-temporal features

Proceedings of the 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance

October 2005

65 72

10.1109/VSPETS.2005.1570899

Liu

Yang

Shah

Learning semantic visual vocabularies using diffusion distance

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops

June 2009

461 468

10.1109/cvprw.2009.5206845

2-s2.0-70450170628

Zhang

Wang

Xiao

Zhou

Liu

Action recognition using context-constrained linear coding

IEEE Signal Processing Letters 2012 19 7 439 442

10.1109/LSP.2012.2191615

2-s2.0-84862279710

Wang

Yang

Huang

Gong

Locality-constrained linear coding for image classification

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ′10)

June 2010

3360 3367

10.1109/cvpr.2010.5540018

2-s2.0-77955996870

Yang

Gong

Huang

Linear spatial pyramid matching using sparse coding for image classification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2009

1794 1801

10.

Candes

E. J.

Tao

Near-optimal signal recovery from random projections: universal encoding strategies?

IEEE Transactions on Information Theory 2006 52 12 5406 5425

10.1109/tit.2006.885507

MR2300700

2-s2.0-33947416035

11.

Liang

Automatic target recognition using waveform diversity in radar sensor networks

Pattern Recognition Letters 2008 29 3 377 381

10.1016/j.patrec.2007.10.016

2-s2.0-37049037503

12.

Laptev

On space-time interest points

International Journal of Computer Vision 2005 64 2-3 107 123

10.1007/s11263-005-1838-7

2-s2.0-24944451092

13.

Shao

Mattivi

Feature detector and descriptor evaluation in human action recognition

Proceedings of the ACM International Conference on Image and Video Retrieval (ACM-CIVR ′10)

July 2010

477 484

10.1145/1816041.1816111

2-s2.0-77955894382

14.

Dasgupta

Gupta

An elementary proof of a theorem of Johnson and Lindenstrauss

Random Structures & Algorithms 2003 22 1 60 65

10.1002/rsa.10073

2-s2.0-0037236821

15.

Liu

Wang

Xiao

Zhang

Shao

Ground-based cloud classification using multiple random projections

Proceedings of the International Conference on Computer Vision in Remote Sensing (CVRS ′12)

December 2012

7 12

10.1109/cvrs.2012.6421224

2-s2.0-84874400057

16.

Goel

Bebis

Nefian

Face recognition experiments with random projection

5779

Biometric Technology for Human Identification II

2005

426 437 Proceedings of SPIE

10.1117/12.605553

17.

Wright

Yang

A. Y.

Ganesh

Sastry

S. S.

Robust face recognition via sparse representation

IEEE Transactions on Pattern Analysis and Machine Intelligence 2009 31 2 210 227

10.1109/tpami.2008.79

2-s2.0-61549128441

18.

Liu

Wang

Xiao

Zhang

Shao

Soft-signed sparse coding for ground-based cloud classification

Proceedings of the 21st International Conference on Pattern Recognition (ICPR '12)

November 2012

2214 2217

2-s2.0-84874567866

19.

Donoho

D. L.

For most large underdetermined systems of linear equations the minimal l₁-norm solution is also the sparsest solution

Communications on Pure and Applied Mathematics 2006 59 6 797 829

10.1002/cpa.20132

MR2217606

2-s2.0-33646365077

20.

Schuldt

Laptev

Caputo

Recognizing human actions: a local SVM approach

Proceedings of the International Conference on Pattern Recognition

2004

32 36

21.

Rodriguez

M. D.

Ahmed

Shah

Action MACH: a spatio-temporal maximum average correlation height filter for action recognition

Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR ′08)

June 2008

1 8

10.1109/cvpr.2008.4587727

2-s2.0-51949084792

22.

Kovashka

Grauman

Learning a hierarchy of discriminative space-time neighborhood features for human action recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2010

2046 2053

23.

Q. V.

Zou

W. Y.

Yeung

S. Y.

A. Y.

Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ′11)

June 2011

3361 3368

10.1109/cvpr.2011.5995496

2-s2.0-80052874098

24.

Wang

Kläser

Schmid

Liu

C.-L.

Action recognition by dense trajectories

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ′11)

June 2011

3169 3176

10.1109/cvpr.2011.5995407

2-s2.0-80052877143

25.

Jiang

Lin

Davis

L. S.

Recognizing human actions by learning and matching shape-motion prototype trees

IEEE Transactions on Pattern Analysis and Machine Intelligence 2012 34 3 533 547

10.1109/tpami.2011.147

2-s2.0-84856134532

26.

Wang

L. M.

Qiao

Tang

Motionlets: Mid-level 3D parts for human motion recognition

Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR ′13)

June 2013

2674 2681

10.1109/cvpr.2013.345

2-s2.0-84887400741