Sage Journals: Discover world-class research

Abstract

Automatic video annotation has become an important issue in visual sensor networks, due to the existence of a semantic gap. Although it has been studied extensively, semantic representation of visual information is not well understood. To address the problem of pattern classification in video annotation, this paper proposes a discriminative constraint to find a solution to approach the sparse representative coefficients with discrimination. We study a general method of discriminative dictionary learning which is independent of the specific dictionary and classifier learning algorithms. Furthermore, a tightly coupled discriminative sparse coding model is introduced. Ultimately, the experimental results show that the provided method offers a better video annotation method that cannot be achieved with existing schemes.

1. Introduction

The notion of visual sensor networks is frequently reported as the convergence between the concepts of sensor networks and distributed smart cameras. As a result, the explosive growth of massive video data is afforded by both individuals and organizations users. Accordingly for the sake of the users to search the target video fast and accurately, in information retrieval realm there is a critical need for resolving the problem of how to organize, manage, and index these video data efficiently, thus video semantic annotation is the significant issue of the video indexing. Video semantic annotation, based on the video context, is giving the video the accurately semantic or conceptual “tag,” which leads to the mapping from the underlying characteristics to high-level semantic concept of the video and narrowing the “semantic gap”; also by employing these tags the video data managers are feasible to efficiently run the operations as accession, contractions, and so forth; moreover, to individual users it makes an approach to search and share videos; besides existing network video search engines like Google, YouTube, and Yahoo! Video mostly use the retrieval technology based on text as it shows its advantage at high speed and relatively maturity, and that the “tags” is an important part to constitute the video text information. However, manual tagging video losses in enormous workload and cost also fail in efficiency, with the drawback of high subjectivity. Hence it is necessary to bring the machine learning methods to approach automatic video semantic annotation based on the analysis of the video content and also to support collaborative annotations and create a shared structured knowledge [1].

Commonly video annotation and retrieval ask for considerable relevance between the videos and the given concept, also called as correlation; nevertheless, that also emphasizes the “topicality” and “uniqueness” of the retrieval results [2]. To boost the efficiency of retrieval and browse for the users, the search engine submitting results call for the “concept” with correlation, and the “sub-concepts” with the quality of diversity as well. Somehow nowadays in the context of enormous explosive growth video data, there exist substantial homologous videos (the various versions edited from one same original video) in the data base; hence, the diversification of “nonredundancy” video retrieval displays its significant. Ideally, the top videos of the submitted results should cover all the “subconcepts” in the “concept” entirely meanwhile exclude “subconcepts” repetition. According to that, as fuzzy as the users input, one side for the video retrieval results is that it conveys more broad and diverse video semantics which leads to further catering for the requirement of the users [3–5]; on the other side the “topicality” of these “unique” videos also obtains the decent flux as encounter with magnanimous data, and thanks to the “nonredundant” retrieval results it enhances the users’ browsing efficiency as well.

Because of the above, when semantically annotate them, we attempt to separate diverse “sub-concepts” videos from each other and discriminatively annotate, which may make feasible to diversify objects during the indexing process. Also, discriminatively preannotate the “subconcepts,” which compared to diversified study accelerates real-time processing speed of the search engines or systems and consequently shortens the costing time the retrieval users wait on line. By taking advantage of sparse code, we aim to give a solution to adopt diversity restraint as the discriminative constraint, and add constraint item in the objective sparse coding function, that approaches to the discrimination for the sparse representative coefficients and dictionary boost, finally map the sparse coding to the kernel space, that attempts to obtain the better video retrieval results.

2. Related Works

2.1. Video Annotation

There are two methods in automatic annotating video: one is based on machine learning [6, 7] and the other is based on searching [8]. The former method based on machine learning adopts the tagged training sets, emulate the classification models such as artificial neural network (ANN) [9], kernel density estimation (KDE), Gaussian mixture model (GMM), support vector machine (SVM) [10], hidden Markov model (HMM) [11], graph model [12], optimized multigraph-based semisupervised learning (OMG-SSL) [13], incremental learning model [14], support tensor machine (STM) [15], cross media relevance model (CMRM) [16], multicorrelation learning [17], probabilistic latent semantic analysis (PLSA) [18], and fusing semantic topics (PLSA-Fusion) [19], and predict the tags of the new video data by referring to the classification model. The annotation based on searching, on the other hand, is to search the videos similar to the target video; besides, it explores the local sample and label distributions to search neighborhood similarity measure [20] and then annotates the target video by label propagation. An open platform VATIC (Video Annotation Tool from Irvine, California) is the released tool for crowd source video labeling, which contributes to providing a video annotation user interface; it has annotated various complicated massive datasets [21]. However, no matter what the approach is to automatically annotate the video, the first step of it is to extract video content from the raw data then effectively represent or further extract the features on them. Information extraction (IE) is a topic in semantic processing, which includes entities, relations, and events in natural language texts [22]. Mainly the feature extraction rules the three functions as (1) removal of the noise interference on the annotation and retrieval, which contains in the low-level features of high complex image, (2) the high dimension as the low-level visual features which are required to feature selection and dimension reduction by feature extraction which results in simplifying subsequent learning and classification, and (3) narrowing the “semantic gap” to extract semantic feature and then bridging from low-level feature to high-level “concept”; with the background of nowadays developed divers classification models, based on its considerable maturity, the video content representation plays a particularly important role in video annotation.

Video annotation usually takes the shot as the basic unit; the video content feature on the other hand is extracted from the key frames. Thus, the image feature is the major feature of the video. Traditional global feature such as color and texture generally encounter the complex difficulty expressing semantic information. But recently the booming technologies based on local features including scale-invariant feature transform (SIFT), rotation-invariant feature transform (RIFT), bag-of-words (BoW), and bag-of-features (BoF) represent enormous potential to convey semanteme. Take BoW as an example; it is a mainstream image representation method, and efficiently adopted for image content representation and categorizing visual objects and feature by a histogram of the visual words [23], as well applied for texture representation by accurately combining different features to construct vocabulary. In [24], it achieves the texture representation grounded on BoW framework, and represent colour-texture image content by the attributes of image blobs. BoF, as another efficient image representation model, describes the image as the statistics or the distribution of local characteristics and also performs invariance in scale, rotation, or illumination, compared to global feature which has enhanced the ability of expressing video semantic [25]; then spatial pyramid matching (SPM) is proposed based on the BoF model that is capable of expressing the spatial relationship among the objects shown in the scene. In a word, all the superiority ensures that BoF model acquires the ability for better visual representation. Also already perform well in video annotation.

2.2. Diverse Video Annotation and Representation

Until now, academic research on diversified retrieval tends to be more than diverse video or image annotation. During the process of diversifying retrieval, trained sets generally afford images with “concept” tags instead of “subconcept” tags; hence, the diversified studding is an unsupervised process. Mostly diversified retrieval technology on the basis of the completed correlation searching achieves reranking the correlation retrieval results by using some unsupervised learning algorithm, and it results in the goal that advanced in diversity. The reranking method is mainly focused on “Greed” selection method and clustering: the “Greedy” selection method, usually according to maximal marginal relevance (MMR) index, selects and reranks the correlated retrieval results and plenty indexes like MMR evaluation index lead to large scale use in diversified retrieval as well [26, 27]; yet considering the drawback of the “Greedy” selection method, the previous unhealthy selection will possibly extend a great influence on the subsequent selection process, thus worsening the whole sorted result. And the diversified searching method based on clustering is based on correlating retrieval results, choosing a certain scale of top-ranked sample set to cluster, and then advancing the samples close to the cluster center, to attain the diversified retrieval results.

Nevertheless, basically all the reranking technologies are based on the searching methods, to study the diversified learning problem from the feature extraction level; no matter if it is annotation or retrieval, whether adopting any reranking methods or not, as mentioned previously, the visual presentation of the video acts as the “bridge” and “bond” mapping from low-level feature to high-level “concept,” and that the visual presentation with excellent performance not only excludes the distraction on the diversified studies caused by the noise underlying the complicated low-level feature but also decreases the dimension of the high dimension feature; hence, it plays a significant role both in classification and retrieval. The pity is that, at present in view of the diversified learning visual representation or feature extraction, the research works still retain in scarceness, not satisfying the requirement for the diversified study to distinguish the “subconcept.”

2.3. Sparse Coding

In recent years, sparse coding has the potential to rapidly advance to further research in machine learning field; for example, the application rage of sparse coding has already extended to blind sources separation [28], voice signal processing [29], image and video feature extraction, signal and image denoting [30], pattern recognition and classification [31, 32], video retrieval, visual tracking, fault diagnosis based on the adaptive feature extraction [33], event detection, image compression, image restoration, reconstruction, imaging, and so forth, and at same time making a remarkable progress on some solving algorithms for sparse coding factors such as basis pursuit (BP) [34], BPDN homotopy, feature sign search (FSS), and “dictionary” learning algorithm such as maximum likelihood estimate (MLE), method of optimal direction (MOD), conjugate gradient method, K-SVD [35], Lagrange dual method [36], and K-LMS.

Sparse coding is widely and successfully employed in the machine vision, which originates from the research finding on neuroscience. Brain primary visual cortex V1 expresses the received visual signals as the restructurings with a few interpretable “basis” [37], in order to employ sparse signal to represent image signal [38]. Accordingly, it has the superiority to apply the sparse coding to the video representation in the video annotation; yet compared to low-level feature, it shows more advantages to analyze the “redundancy” of visual patterns through the semantic layer, which has been confirmed by the applied research of the video thumbnail [39]. Therefore, it is viable to adopt sparse coding for the video representation and extend it to diversify annotation which is the aim to eliminate the “redundancy” of the retrieval results besides enhancing its “topicality.”

3. Method Description

For a general classification problem, the training sample points can be represented as a matrix $X = {[X_{1}, X_{2}, \dots, X_{n}]}^{T}$ , $X_{i} \in R^{m}$ , $i = 1, \dots, n$ , where n is the sample number and m is the feature dimensions. The class label of the sample $X_{i}$ is denoted by $c_{i} \in {1, 2, \dots, Nc}$ , where $Nc$ is the number of the classes. In practice, the feature dimension m is often very high [40]. The goal of the proposed algorithm is to transform the data from the original high-dimensional space to a low-dimensional one; that is, $Y \in R^{d \times n}$ with $d > m$ . Moreover, the transformation will separate the different manifolds farther under the constraint of local structure preserving.

3.1. Introduction to Sparse Coding

Sparse coding is a way that selects the least possible basis from an over-complete dictionary to represent the images signal under certain reconstruction error constraints. Intuitively, the sparsity of the coding coefficients can be measured by l0-norm, which determines the number of nonzero entries in a vector or matrix. Since l0-norm regularization is a NP-hard problem, l1-norm regularization is widely employed in sparse coding, as it is shown that l0-norm and l1-norm regularization are equivalent under certain conditions [41].

In detail, let $Y = [y_{1}, y_{2}, \dots, y_{k}] \in R^{m \times k}$ be a set of m-dimensional samples, and let $D = [d_{1}, d_{2}, \dots, d_{n}] \in R^{m \times n}$ be the dictionary with n entries. The corresponding sparse codes $X = [x_{1}, x_{2}, \dots, x_{k}] \in R^{n \times k}$ over dictionary D can be computed by solving the following optimization objective:

\begin{array}{l} \min_{X} \sum_{i = 1}^{k} {∥ y_{i} - D x_{i} ∥}^{2} + λ {∥ x_{i} ∥}_{1} \\ \begin{array}{l} = \min_{X} \sum_{i = 1}^{k} x_{i}^{T} D^{T} D x_{i} - 2 y_{i}^{T} D x_{i} + λ {∥ x_{i} ∥}_{1} + y_{i}^{T} y, \end{array} \end{array}

(1)

where the parameter λ balances reconstruction quality term and sparse term. It is well known that the regularization

l_{1}

induces sparsity and makes the problem tractable [42].

3.2. Diversification Video Retrieval and Representation

Adopt diversity restraint as a discriminant constraint among different categories. For diversity constraint item between different categories it already has succeeded in the applications of subspace learning method, enables advancement in the criteria for feature mapping, and also increases the separating capacity among various categories; applied to sparse coding modality it could be expressed as $\sum_{i, j} {∥ v_{i} - v_{j} ∥}^{2} h_{i, j}$ , wherein $v_{i}$ , $v_{j}$ are the coding of feature $x_{j}$ , $x_{j}$ separately, and $h_{i, j}$ denotes the associated information of pros and cons categories, when $x_{j}$ , $x_{j}$ belong to the same category then define its value as 1, otherwise as 0 or −1. Meanwhile, Laplacian constraint performs better in describing the dependence relationship among local features and also decreasse the susceptibility of the local noise caused by sparse code. Hence, add constraint item $\sum_{i, j} {∥ v_{i} - v_{j} ∥}^{2} l_{i, j}$ to the objective function of the sparse coding to maintain the local structure, in which $l_{i, j}$ is the connection weight value between the sample points $x_{i}$ , $x_{j}$ , and usually its reverse neighborhood relationship among samples is defined by K nearest neighbor (KNN); due to $v_{i}, v_{j}$ as the l variables of histogram statistics, it defines the K neighbor relationship among codes to use the more effective histogram system section measure, rather than the Euclidean distance. The constraint sparse coding objective function may be described as

\begin{array}{l} \min_{U, V} {∥ x - U V ∥}_{F}^{2} + λ \sum_{i} {∥ V ∥}_{1} \\ + μ \sum_{i, j} {∥ v_{i} - v_{j} ∥}^{2} h_{i, j} + β \sum_{i, j} {∥ v_{i} - v_{j} ∥}^{2} l_{i, j} . \end{array}

(2)

Here, U is the complete “dictionary,” $u_{m}$ is the U's column vector, V means the sparse representation matrix, and $v_{i}$ , $v_{j}$ are the column vectors of V. Apparently, when optimizing U, V at the same time we cannot confirm convex optimization; however, if one of them is fixed, then we optimize the other one, and the object is convex optimization. Therefore, it can adopt the method of alternative optimization to study the U and V; the design is specified as follows.

(1) Defined template feature set: when the scale of the local feature tends to be larger, to raise the efficiency, the approach is to randomly collect some features as a template feature set $X_{t}$ and also to solve the sparse representation coefficient matrix of the template features $V_{t}$ .

(2) Solving constrained sparse representation coefficient: for a feature x, first we figure out the corresponding vectors $h_{i}$ and $l_{i}$ that respectively come from the category relevance information between the vectors and each template feature and also its neighborhood relationship in the template features. As x and $x_{i}$ belong to the same category, the $h_{i}$ equals 1, otherwise 0 or −1; $l_{i}$ is the connection weight value between the sample x and $x_{i}$ , which is defined by the K nearest neighbor method based on histogram intersection estimate. Then optimize the following objective function to solve the constraint sparse representation coefficient of feature x:

\begin{matrix} \min_{v} {∥ x - U v ∥}_{F}^{2} + λ {∥ V ∥}_{1} + μ \sum_{i} {∥ v - v_{i} ∥}^{2} h_{i} + β \sum_{i} {∥ v - v_{i} ∥}^{2} l_{i}, \end{matrix}

(3)

wherein

v_{i} \in V_{t}

denotes the sparse representation coefficient of

x_{i}

, and the function mentioned above could be solved by basis pursuit (BP) or feature sign search (FSS). According to BP method, solve the discriminant constraint and Laplacian constraint as a new additional constraint condition in the linear programming; on the other side for FSS method, referring to the change of objective function, it makes appropriate progress in calculating the gradient of objective function and solving quadratic programming problems after the symbolic sparse constraint.

Obviously, when the local feature and template feature tends to be larger, because of the neighborhood region or the smaller defined K value, the new additional Laplacian constraint items are sparse in the objective function (2), due to the more local feature, which leads to more discriminant constraint items represented as nonsparse. For BP method, the more constraint items are, the more difficulty in the linear programming of convergence there will be, but for the FSS method, if the discriminant constraint items are nonsparse, this will highly increase the calculation of the gradient and the complexity of the matrix in the quadratic programming directly. Thus, in order to decrease the complexity of the calculation, adopt the following objective function:

\begin{matrix} \min_{v} {∥ x - U v ∥}_{F}^{2} + λ {∥ V ∥}_{1} + γ \sum_{i} {∥ v - v_{i} ∥}^{2} l_{i} h_{i}, \end{matrix}

(4)

to approach the objective function (2), which defines

\sum_{i} {∥ v - v_{i} ∥}^{2} l_{i} h_{i}

as the Laplacian discriminant constraint and ensures the neighborhood feature sparse representation coefficient discriminant constraint. Also it is applied to approach the discriminant constraint item in objective function (2) that guarantees the discriminant constraint sparse and simplifies the learning at same time it remains constraint in the structural relationship of the local feature.

(3) Discriminant “dictionary” learning: assume the “dictionary” as $U = [U^{+}, U^{-}]$ . The character of the “dictionary” is the “basis” of the positive dictionary $U^{+}$ and is always leaning to represent the positive feature, but in the negative dictionary $U^{-}$ it usually leans to negative feature, and this means that discriminant learning requires the “basis” in the “dictionary” correlative to the categories; hence, it is able to adopt the positive feature $X^{+}$ and negative $X^{-}$ to learn from $U^{+}$ and $U^{-}$ , respectively; when V is fixed, it can update “dictionary” by optimizing the following objective function:

\begin{matrix} \min_{U^{+}} {∥ X^{+} - U^{+} {\bar{V}}^{+} ∥}_{F}^{2}, s . t . {∥ u_{m}^{+} ∥}^{2} \leq 1, \\ \min_{U^{-}} {∥ X^{-} - U^{-} {\bar{V}}^{-} ∥}_{F}^{2}, s . t . {∥ u_{m}^{-} ∥}^{2} \leq 1 . \end{matrix}

(5)

The optimization problems in function (5) are solved by adopting conjugate gradient method, Lagrange paired method, or K-SVD, and according to the disassembly of the “dictionary”, the

V^{+}

V^{-}

in the functions aim to match the quantity of the “basis” in the “dictionary” subsets, they are supposed to be as the submatrix of

V^{+}

and

V^{-}

, meanwhile adopt them to replace

V^{+}

. After

V^{-}

the “loss” of the representation should be as small as possible, so by extracting the key row vectors from

V^{+}

and

V^{-}

as the

V^{+}

and

V^{-}

, the process can be mentioned in the following method.

(a) Represent V as $V = [V^{+}, V^{-}]$ . $V^{+} = (V_{i, j}^{+})$ and $V^{-} = (V_{i, j}^{-})$ , respectively denoted as the sparse representation coefficient matrix of the positive feature set $X^{+}$ and the negative feature set $X^{-}$ , and then can calculate as

\begin{matrix} a = {[a_{1}^{+} - a_{1}^{-}, a_{2}^{+} - a_{2}^{-}, \dots, a_{k}^{+} - a_{k}^{-}]}^{T}, \end{matrix}

(6)

a_{i}^{+} = \sum_{j} v_{i j}^{+}

a_{i}^{-} = \sum_{j} v_{i j}^{-}

, they are respectively as the usage rate that the positive and negative feature use the

i

th “basis” in the “dictionary”; apparently, the larger as the factor

a_{i}

is, the more the

i

th “basis” tends to represent the positive feature, so the relevant sparse representation coefficient takes greater weight in the positive feature representation,

a

should be chose into

V^{+}

. On the contrary, it tends to be large as negative feature representation; the relevant sparse representation coefficient takes greater weight in the negative feature representation, a should be chose into

{\bar{V}}^{-}

(b) Represent Vas row vector $V = {[v_{1}, v_{2}, \dots, v_{k}]}^{T}$ and also rerank each row of V in descending order according to the value of the relevant factor $a_{i}$ in the vector quantity $a$ , which leads to $\tilde{V} = {[{\tilde{v}}_{1}, {\tilde{v}}_{2}, \dots, {\tilde{v}}_{k}]}^{T}$ .

(c) It can be figured out that to split row in different positions leads to acquiring different $V^{+}$ and $V^{-}$ , which is required to find the optimal splitting row; the method is by displacing ${\overset{⃛}{V}}^{+}$ and ${\overset{⃛}{V}}^{-}$ as 0, receiving $\tilde{V}$ , then calculating the distance among categories of the positive and negative sparse representation coefficient in $\tilde{V}$ , or Fisher criterion function value $d_{i}$ (histogram intersection measure); as the splitting position changes, the $d_{i}$ changes relevantly, and finding the corresponding line of the max value $d_{i}$ is to find the optimal splitting row. Alternate the above steps (2) and (3). Parameters λ and γ, respectively, are denoted as the weight of sparse constraint and Laplacian discriminant constraint, which can be adjusted by adopting cross validation which refers to area under curve (AUC), equal error rate (EER), ROC curve, or average precision (AP) index. Meanwhile, these indexes can be used as the estimations of the relevant annotation and retrieval results.

3.3. Diversify Video Key Frame Representation Based on Local Structure Constraint Sparse Coding

Usually, there is shortage of “subconcept” tags in the training video samples; hence diversified annotation and retrieval are the process of unsupervised learning and classification, while, during the sparse coding, it will have an effect on maintaining positive local topology structure that constrains the negative local topology structure of the “concept,” thus decreasing the distinctiveness of the “subconcept” in the positive category. So it is available to adopt “neglect” negative local constraint to strengthen the maintaining local topology structure of the “subconcept” and also to be more similar between neighborhood sparse coding coefficients as diversifying nonneighborhood coding coefficients, which leads to improving the separability of the “subconcept” and reducing the noises sensitive that underlies in the sparse coding of the “subconcept” neighborhood. The concrete implement is that, only constrain the neighborhood relation between the samples and positive templates:

\begin{matrix} \min_{v} {∥ x - U v ∥}_{F}^{2} + λ {∥ V ∥}_{1} + β \sum_{i} {∥ v - v_{i} ∥}^{2} l_{i} C_{i}, \end{matrix}

(7)

wherein

c_{i}

represents the category tag of template feature

x_{i}

to which

v_{i}

corresponded, as positive means 1 and negative is 0. Similarly, it is able to solve the sparse coding coefficient v of feature x by using BP or FSS algorithms; then directly adopt conjugate gradient method, Lagrange dual entropic, or the K-SVD to learn “dictionary” U, that optimizing the following objective function:

\begin{matrix} \min_{U} {∥ x - U V ∥}_{F}^{2} . \end{matrix}

(8)

As mentioned above, $l_{i}$ is the connection weight between features $l_{i}$ and $x_{i}$ , which is defined by the neighborhood relationship of them or K neighbor method. Apparently, here the value of K defined the neighborhood relationship that reflects the region of the “subconcept” and contributes to the optimal matching “subconcept” distribution; otherwise, in objective function (7) the constraint questions involve adjustable coefficients and attempt to adjust these coefficients by employing cross validation which refers to the diversified annotation results.

3.4. Sparse Coding in Kernel Space

To solve the sparse coding mapping in kernel space, it aims to obtain better performance at video retrieval results. Assume the function φ satisfies $φ {(x)}^{T} φ (y) = κ (x, y)$ , and it can map feature and “basis” to higher dimension space: $x \to φ (x)$ , $U = [u_{1}, u_{2}, \dots, u_{k}] \to U = [φ (u_{1}), φ (u_{1}), \dots, φ (u_{k})]$ , and $k$ as the quantity of “basis” in the “dictionary”; thus in higher dimension space, the sparse coding objective function with distinguishing Laplacian constrain is

\begin{matrix} \min_{U, v} {∥ φ (x) - U v ∥}_{F}^{2} + λ {∥ v ∥}_{1} + γ \sum_{i} {∥ v - v_{i} ∥}^{2} l_{i} h_{i} . \end{matrix}

(9)

Similarly, it can be solved by the alternative optimized method for the above function (functions (3) and (7) refer to the similar relevant method in the kernel space). When fixing U to work out v, the objective function can be translated into

\begin{array}{l} \min_{v} {∥ φ (x) - U v ∥}_{F}^{2} + λ {∥ v ∥}_{1} + γ \sum_{i} {∥ v - v_{i} ∥}^{2} l_{i} h_{i} \\ = 1 + v^{T} K_{U U} v - 2 v^{T} K_{U} (x) + λ {∥ v ∥}_{1} + γ \sum_{i} {∥ v - v_{i} ∥}^{2} l_{i} h_{i} . \end{array}

(10)

In which, $k_{U U}$ is a $k \times k$ matrix, ${k_{U U}}_{i j} = κ (u_{i}, u_{j})$ , $k_{U} (x)$ is a k-dimension column vector, compared to the sparse coding in original space, and ${k_{U} (x)}_{i} = κ (u_{i}, x)$ is the kernel objective function which varies only in kernel matrix $K_{U U}$ (corresponding to original space $U^{T} U$ ) and the calculation of the kernel vector $k_{U} (x)$ ; accordingly, for kernel space it can directly solve the sparse representation coefficient v by employing BP or FDD algorithms after obtaining $K_{U U}$ and $k_{U} (x)$ . Yet when fixing v to work out U, it is available to adopt gradient method on every column of U to do an alternating iterative approximate approach of relevant “dictionary” solution in kernel space, among them, the following gradient captured by taking partial derivatives of each column $u_{m}$ :

\begin{array}{l} f (U) = \frac{1}{N} \sum_{i = 1}^{N} {∥ φ (x_{i}) - U v_{i} ∥}^{2} \\ = \frac{1}{N} \sum_{i = 1}^{N} {∥ 1 + v_{i}^{T} K_{U U} v_{i} - 2 {v_{i}}^{T} K_{U} (x_{i}) ∥}^{2} . \end{array}

(11)

In the formulation $N$ is the quantity of local feature N.

4. Experiments and Results

4.1. Experimental Datasets

In this section, we compare the proposed method with other methods, such as the CMRM [16], PLSA [18], and PLSA-fusion [19]. We select experimental datasets from CC_WEB_VIDEO [43] which include the video from five semantic concepts; each concept has many subconcepts, as shown in Table 1. Each concept represents the set of objects sharing the same values for a certain set of properties; each subconcept contains a subset of the objects in the concepts above it.

Table 1

Test sample videos.

Topic	Video shots	Subconcepts
Traffic tools	95	7
Person	122	11
Weather	82	8
Animal	79	10
Road	85	10

4.2. Automatic Annotation Comparison

We achieved the video annotation performance by comparing the test set automatically and then labelling the original label for evaluation. The recall was calculated based on how many correct words we extracted compared to the Total words detected $({Total}_{Recall} = Correct + False)$ :

\begin{matrix} Recall = \frac{R_{a}}{R_{a} + R_{s}} . \end{matrix}

(12)

Table 2 shows the recall rate from the tracking performance after the processing step. In addition, Table 3 shows the precision rate of the same processing step.

Table 2

Mean per-word recall.

Topic	CMRM	PLSA	PLSA-fusion	Sparse coding
Traffic tools	0.32	0.51	0.55	0.57
Person	0.35	0.42	0.45	0.52
Weather	0.41	0.47	0.52	0.48
Animal	0.42	0.45	0.49	0.55
Road	0.39	0.47	0.50	0.54

Table 3

Mean per-word precision.

Topic	CMRM	PLSA	PLSA-fusion	Sparse coding
Traffic tools	0.24	0.25	0.26	0.31
Person	0.17	0.24	0.26	0.29
Weather	0.19	0.36	0.35	0.30
Animal	0.16	0.23	0.24	0.29
Road	0.20	0.21	0.22	0.24

The Precision was calculated based on how many correct regions we extracted against the Total number of actual active regions that our system should have detect $({Total}_{Precision} = Correct + Miss)$ based on ground truth.

4.3. Comparison of Semantic Search Results

A problem that arises is that the diversified annotation is usually achieved by clustering algorithm; therefore, cross validation or result estimation could refer to the clustering index such as Davies-Bould or Dunn Index, but the clustering index excludes the relevant valuation, which fails in reflecting the influence that the diversified learning affects the correlation. Thus this project combines correlation and diversification valuations, then propose Maximal Scatter Relevance (MSR) estimation index to choose coefficients, as

\begin{matrix} {MSR}^{(n)} = \frac{1}{n} \sum_{i = 1}^{n} d (I_{i}, \bar{I}) * r (I_{i}), \bar{I} = \frac{\sum_{i = 1}^{n} I_{i} * r (I_{i})}{\sum_{i = 1}^{n} r (I_{i})} . \end{matrix}

(13)

Function n means the quantity of estimation samples;

r (I_{i})

represents the relevant valuations of video;

I_{i}

I_{i}

is equal to 1 as it correlates with “concept” or 0, which guarantees the correlation;

d (I_{i}, \bar{I})

is the distance between video

I_{i}

and the mean value I of the “concept” relevant video that is on behalf of the diversification between samples. Compared to the mentioned clustering category index, the MSR index in this paper includes correlation and diversification valuations at the same time; also compared to the existing MMR (Maximal Marginal Relevance) index, the correlation and diversified representation qualify more gradation, that improve the diversification beyond the ensured correlation; moreover, there is no variable coefficient in MSR valuation index and thus it has stability. Table 4 shows the results from the comparison of ranked retrieval results using MSR.

Table 4

Comparison of ranked retrieval results.

Method	Mean average precision
CMRM	0.205
PLSA	0.275
PLSA-fusion	0.292
Sparse coding	0.314

4.4. Discussion

From the above results, we can conclude that our method has good performance and surpasses the other competing methods. The experimental results showed that the proposed method was able to improve the performance at the video annotation and retrieval task, especially in mean per-word precision.

5. Conclusion

In summary, the paper describes the methods of a solution of the sparse representative coefficients with discrimination, the learning of the sparse representative coefficients, and dictionary boost of each other for discrimination and forms a tightly coupled discriminative sparse coding model. For a future work, we will extend our sparse coding into the kernel space, to obtain the accurate expressions of spatial orders and video sequences which are associated with concepts and subconcepts.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the National Nature Science Foundation of China (41101432, 41201378), the Natural Science Foundation Project of Chongqing (cstc2012jjA40014), and the Scientific and Technological Research Program of Chongqing Municipal Education Commission (KJ120526).

References

Grassi

Morbidoni

Nucci

A collaborative video annotation system based on semantic web technologies

Cognitive Computation 2012 4 4 497 514

10.1007/s12559-012-9172-1

Boyce

Beyond topicality: a two stage view of relevance and the retrieval process

Information Processing & Management 1982 18 3 105 109

2-s2.0-0020304239

10.1016/0306-4573(82)90033-4

Yin

Novelty and topicality in interactive information retrieval

Journal of the American Society for Information Science and Technology 2008 59 2 201 215

2-s2.0-38849095756

10.1002/asi.20709

Zhang

Hurley

Avoiding monotony: improving the diversity of recommendation lists

Proceedings of the 2nd ACM International Conference on Recommender Systems (RecSys '08)

October 2008

New York, NY, USA

123 130

2-s2.0-63449126087

10.1145/1454008.1454030

Halvey

Punitha

Hannah

Villa

Hopfgartner

Goyal

Jose

J. M.

Diversity, assortment, dissimilarity, variety: a study of diversity measures using low level features for video retrieval

5478

Proceedings of the 31th European Conference on IR Research (ECIR '09)

April 2009

Toulouse, France

126 137

Harada

Nakayama

Kuniyoshi

Otsu

Image annotation and retrieval for weakly labeled images using conceptual learning

New Generation Computing 2010 28 3 277 298

10.1007/s00354-009-0090-z

Tang

Zheng

Y.-T.

Wang

Chua

T.-S.

Sparse ensemble learning for concept detection

IEEE Transactions on Multimedia 2012 14 1 43 54

2-s2.0-84856139609

10.1109/TMM.2011.2168198

Zhao

W.-L.

Ngo

C.-W.

On the annotation of web videos by efficient near-duplicate search

IEEE Transactions on Multimedia 2010 12 5 448 461

2-s2.0-77954742761

10.1109/TMM.2010.2050651

Wang

J.-J.

Chen

B.-L.

Yang

C.-Y.

Approximation of algebraic and trigonometric polynomials by feedforward neural networks

Neural Computing and Applications 2012 21 1 73 80

2-s2.0-84855994695

10.1007/s00521-011-0617-3

10.

Glotin

Razik

Pascale

Paris

Benard

Sparse coding for fast minke whale tracking with Hawaiian bottom mounted hydrophones

Proceedings of the 5th International Workshop on Detection, Classification, Localization & Density Estimation of Marine Mammals Using Passive Acoustics (DCL '11)

2011

Portland, Ore, USA

11.

Xie

Chang

S.-F.

Divakaran

Sun

Structure analysis of soccer video with hidden Markov models

Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP '02)

May 2002

Orlando, Fla, USA

4096 4099

2-s2.0-0036288591

12.

Liu

Image annotation via graph learning

Pattern Recognition 2009 42 2 218 228

2-s2.0-53549116749

10.1016/j.patcog.2008.04.012

13.

Wang

Hua

X.-S.

Hong

Tang

G.-J.

Song

Unified video annotation via multigraph learning

IEEE Transactions on Circuits and Systems for Video Technology 2009 19 5 733 746

2-s2.0-67249090373

10.1109/TCSVT.2009.2017400

14.

Hua

X.-S.

Zhang

H.-J.

Zhang

An online-optimized incremental learning framework for video semantic classification

Proceedings of the 12th ACM International Conference on Multimedia (MM '04)

October 2004

New York, NY, USA

320 323

2-s2.0-13444274435

15.

Liu

Zhuang

Tensor-based transductive learning for multimodality video semantic concept detection

IEEE Transactions on Multimedia 2009 11 5 868 878

2-s2.0-68549126868

10.1109/TMM.2009.2021724

16.

Lavrenko

Manmatha

Jeon

A model for learning the semantics of pictures

Proceedings of the 16th Annual Conference on Neural Information Processing Systems (NIPS '03)

2003

Istanbul, Turkey

307 315

17.

Liu

MLRank: multi-correlation learning to rank for image annotation

Pattern Recognition 2013 46 10 2700 2710

10.1016/j.patcog.2013.03.016

18.

Hofmann

Unsupervised learning by probabilistic latent semantic analysis

Machine Learning 2001 42 1-2 177 196

2-s2.0-0034818212

10.1023/A:1007617005950

19.

Z.-X.

Shi

Z.-P.

Z.-Q.

Shi

Z.-Z.

Automatic image annotation by fusing semantic topics

Journal of Software 2011 22 4 801 812

2-s2.0-79955949229

10.3724/SP.J.1001.2011.03742

20.

Wang

Hua

X.-S.

Tang

Hong

Beyond distance measurement: constructing neighborhood similarity for video annotation

IEEE Transactions on Multimedia 2009 11 3 465 476

2-s2.0-63049138297

10.1109/TMM.2009.2012919

21.

Vondrick

Patterson

Ramanan

Efficiently scaling up crowdsourced video annotation

International Journal of Computer Vision 2013 101 1 184 204

22.

Küçük

Yazıcı

A semi-automatic text-based semantic video annotation system for Turkish facilitating multilingual retrieval

Expert Systems with Applications 2012 40 9 3398 3411

10.1016/j.eswa.2012.12.048

23.

Zhang

H.-G.

Guo

Bhanu

Improving bag-of-words scheme for scene categorization

The Journal of China Universities of Posts and Telecommunications 2012 19 supplement 2 166 171

24.

Fernandez

S. A.

Vanrell

Texton theory revisited: a bag-of-words approach to combine textons

Pattern Recognition 2012 45 12 4312 4325

10.1016/j.patcog.2012.04.032

25.

Zhang

Jia

Chen

Image retrieval with geometry-preserving visual phrases

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '11)

June 2011

Providence, RI, USA

809 816

2-s2.0-80052903063

10.1109/CVPR.2011.5995528

26.

Lin

Luo

Chen

Zeng

Representing and recognizing objects with massive local image patches

Pattern Recognition 2012 45 1 231 240

2-s2.0-80052726060

10.1016/j.patcog.2011.06.011

27.

Sarin

Kameyama

Targeting diversity in photographic retrieval task with commonsense knowledge

Proceedings of the 9th Workshop of the Cross Language Evaluation Forum (CLEF '08)

2008

Aarhus, Denmark

28.

Deselaers

Gass

Dreuw

Ney

Jointly optimising relevance and diversity in image retrieval

Proceedings of the ACM International Conference on Image and Video Retrieval (CIVR '09)

July 2009

Island of Santorini, Greece

296 303

2-s2.0-74049104385

10.1145/1646396.1646443

29.

Gribonval

Sparse decomposition of stereo signals with matching pursuit and application to blind separation of more than two sources from a stereo mixture

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02)

May 2002

Orlando, Fla, USA

3057 3060

30.

Razik

Glotin

Paris

Olivier

Humpback whale song sparse coding and information theory analysis

Proceedings of the 5th International Workshop on Detection, Classification, Localization & Density Estimation of Marine Mammals Using Passive Acoustics (DCL '11)

2011

Portland, Ore, USA

31.

Sun

Image inpainting by patch propagation using patch sparsity

IEEE Transactions on Image Processing 2010 19 5 1153 1165

2-s2.0-77951292197

10.1109/TIP.2010.2042098

32.

Gong

Yang

Zhang

Gait identification by sparse representation

Proceedings of the 8th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD '11)

July 2011

Shanghai, China

1719 1723

2-s2.0-80053396705

10.1109/FSKD.2011.6019819

33.

Yang

Gong

Huang

Linear spatial pyramid matching using sparse coding for image classification

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR '09)

June 2009

Miami, Fla, USA

1794 1801

2-s2.0-70450209196

10.1109/CVPRW.2009.5206757

34.

Amari

S.-I.

Cichocki

D. W. C.

Xie

Underdetermined blind source separation based on sparse representation

IEEE Transactions on Signal Processing 2006 54 2 423 437

2-s2.0-31344466301

10.1109/TSP.2005.861743

35.

Feng

Song

Yang

Zhang

Sub clustering K-SVD: size variable dictionary learning for sparse representations

Proceedings of the IEEE International Conference on Image Processing (ICIP '09)

November 2009

Cairo, Egypt

2149 2152

2-s2.0-77951965380

10.1109/ICIP.2009.5414328

36.

Lee

Battle

Raina

A. Y.

Efficient sparse coding algorithms

Proceedings of the 21st Annual Conference on Neural Information Processing Systems (NIPS '07)

2007

British Columbia, Canada

801 808

37.

Shan

Jiao

New evidences for sparse coding strategy employed in visual neurons: from the image processing and nonlinear approximation viewpoint

Proceedings of the 3rd European Symposium on Artificial Neural Networks (ESANN '05)

2005

Bruges, Belgium

441 446

38.

Olshausen

B. A.

Field

D. J.

Sparse coding with an overcomplete basis set: a strategy employed by V1?

Vision Research 1997 37 23 3311 3325

2-s2.0-0030779611

10.1016/S0042-6989(97)00169-7

39.

Zhu

Fan

Elmagarmid

A. K.

Hierarchical video content description and summarization using unified semantic and visual similarity

Multimedia Systems 2003 9 1 31 53

2-s2.0-1442333048

40.

Huang

D.-S.

Wang

Liu

K.-H.

Feature extraction using constrained maximum variance mapping

Pattern Recognition 2008 41 11 3287 3294

2-s2.0-48149086066

10.1016/j.patcog.2008.05.014

41.

Lin

Chen

The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices

2009 UILU-ENG-09-2215

Urbana, Ill, USA

University of Illinois at Urbana-Champaign

42.

Mairal

Bach

Ponce

Sapiro

Zisserman

Non-local sparse models for image restoration

Proceedings of the 12th IEEE International Conference on Computer Vision (ICCV '09)

October 2009

Kyoto, Japan

2272 2279

2-s2.0-77952739016

10.1109/ICCV.2009.5459452

43.

Hauptmann

A. G.

Ngo

C.-W.

Practical elimination of near-duplicates from web video search

Proceedings of the 15th ACM International Conference on Multimedia (MM ’07)

September 2007

Augsburg, Germany

218 227

10.1145/1291233.1291280

An Efficient Method for Automatic Video Annotation and Retrieval in Visual Sensor Networks

Abstract

1. Introduction

2. Related Works

2.1. Video Annotation

2.2. Diverse Video Annotation and Representation

2.3. Sparse Coding

3. Method Description

3.1. Introduction to Sparse Coding

3.2. Diversification Video Retrieval and Representation

3.3. Diversify Video Key Frame Representation Based on Local Structure Constraint Sparse Coding

3.4. Sparse Coding in Kernel Space

4. Experiments and Results

4.1. Experimental Datasets

4.2. Automatic Annotation Comparison

4.3. Comparison of Semantic Search Results

4.4. Discussion

5. Conclusion

Footnotes

Conflict of Interests

Acknowledgments

References