Intelligent Analysis for Georeferenced Video Using Context-Based Random Graphs

Abstract

Video sensor networks are formed by the joining of heterogeneous sensor nodes, which is frequently reported as video of communication functionally bound to geographical locations. Decomposition of georeferenced video stream presents the expression of video from spatial feature set. Although it has been studied extensively, spatial relations underlying the scenario are not well understood, which are important to understand the semantics of georeferenced video and behavior of elements. Here we propose a method of mapping georeferenced video sequences for geographical scenes and use contextual random graphs to investigate semantic knowledge of georeferenced video, leading to correlation analysis of the target motion elements in the georeferenced video stream. We have used the connections of motion elements, both the correlation and continuity, to present a dynamic structure in time series that reveals clues to the event development of the video stream. Furthermore, we have provided a method for the effective integration of semantic and campaign information. Ultimately, the experimental results show that the provided method offers a better description of georeferenced video elements that cannot be achieved with existing schemes. In addition, it offers a new way of thinking for the semantic description of the georeferenced video scenarios.

1. Introduction

The notion of wireless multimedia sensor networks (WMSNs) is frequently reported as the convergence between the concepts of wireless sensor networks and distributed smart cameras [1]. As a result, an increasing number of video clips is being collected, which has created complex data-handling challenges [2]. Further, some types of video data are naturally tied to geographical locations. For example, video data from traffic monitoring may not contain much meaning without its associated location information. Therefore, most potential applications of a WMSN require the sensor network paradigm to support location-based multimedia services as well as manipulate large scale data at the same time to provide a high quality of experience (QoE), which raises an important issue. How to investigate an intelligent processing method for georeferenced multimedia? Although the question has been extensively addressed theoretically, the method of mapping video sequences to geographical scenes remains to be described. On the other hand, with the growth of geographic information system (GIS) whose major growth area is the convergence between GIS and multimedia technology, a new paradigm named video-GIS emerged [3–5]. The major researches facing video-GIS are the coding of georeferenced video and the content and types of services that should be provided by georeferenced video. Further improvement of these processes is contingent on deeper understanding of video, as well as improved understanding of the spatial relationship of geographic space. It is due to the necessity of using video-GIS to visualize the relationship between the video analysis methods and the real geographical scene, resulting in georeferenced multimedia intelligent processing method based on context-based random graphs.

Georeferenced video is fundamental process in video-GIS development. Prior research activities on georeferenced video technologies and applications have been conducted. Most of them make use of video and GPS sensors. In [6, 7], Stefanakis and Peterson and Klamma et al. proposed a unified framework for hypermedia and GIS. Pissinou et al. [8] explored topology and direction under the proposed georeferenced video. The work of Hwang et al. and Joo et al. [9, 10] defined the metadata of georeferenced video, which support interoperability between GIS and video images. In the field of georeferenced video search, Liu et al. [11] presented a sensor-enhanced video annotation system, which searches video clips for the appearance of particular objects. Ay et al. proposed the use of geographical properties of videos [12], while Wang gave a method of time-spatial images to extract the basic movement information [13]. Although single media have been studied extensively, its semantics in geographic space are poorly understood. How to determine the spatial relationship of video elements is one of the most important operations on georeferenced video. For instance, a moving video element changes its position, shape, size, speed, and attribute values over time. Understanding the changing process and rules of these attributes is of important significance to the geographical description of the video.

Many techniques for video event recognition have been proposed. As the work on model-driven methodology which has become well established and approached maturity, the most common and popular conceptualization of fusion systems is the SVM model [14, 15]. However, such methodology not only cannot solve the problems, such as multi-instance, diversity, and multimodal, but needs a large number of training samples. Most previous studies to date have used data-driven method [16] which has been carefully designed to signal clear and distinct semantic of the videos [17–21]. In our event recognition application, we observe that some events may share common motion patterns. Though involved in pattern discovery, data-driven method also contributes to social network during pattern discovery [22–25]. These works have showed a high accuracy in the differentiating of video and its semantic extraction frame. However, most multimedia applications are unknown and uncertain, which are extremely difficult to meet the requirements of real-time stream processing.

Previous studies have shown that multimedia intelligent processing method is important to the development of video-GIS and have achieved inspiring progress. However, these solution methods have suffered from the classical ensemble average limitation presented by the analysis of low-level characteristics. Therefore, the spatial data gathered are sometimes inconclusive and, in part, contradictory. These algorithms usually build or learn a model of the target object first and then use it for tracking, without adapting the model to account for the changes in the appearance of the object, for example, large variation of pose or facial expression, or the surroundings, for example, lighting variation. Furthermore, it is assumed that all images are acquired with a stationary camera. Such approach, in our view, is prone to performance instability, and thus it needs to be addressed when building a robust visual tracker.

To overcome these problems, we will begin by looking at some valid models, which are suitable for georeferenced video understanding and behavior analysis. In this paper, we propose a new event recognition framework for consumer videos by leveraging a large amount of videos. As we know, graph structure provides a complex, dynamic, and robust framework for assembling complex relationships involved in the objects [26], which is suitable for our goal. Thus, multiple random behaviors are presented in certain movement, making the graph structure unsuitable for describing the real video scenario. To circumvent this problem, random graph model has been taken into consideration, which can be seen as a rather simplified model of the evolution of certain communication net [27]. In our research, it could simplify the analysis of the interaction between video objects substantially for revealing the new insight into the relationships between objects and its complex interaction. Our analysis focuses on describing spatial relationships bound to objects using random graph grammar in georeferenced video, developing a scientific analysis of behavior and structured methods of georeferenced video understanding.

2. Preliminary

Surveillance video data is mostly non-ortho image data so that it does not match up with the geography scene vector data using the traditional method. To solve this problem, a mapping method of video scene imaging data to geography scene vector data is adopted in the paper, as showed in Figure 1. Firstly, the virtual viewpoint camera is constructed by the camera interior and exterior parameters. Secondly, geography scene virtual imaging can be gained from geography scene vector data using the process of model transformation, viewpoint transformation, and pruning according to the computer graphics rendering process, with the corresponding relationship between an object in virtual imaging and vector object. Thirdly, the image matching technology based on the features that have invariant character for translation, scale and rotation is used to match the geography scene virtual imaging and video image. Finally, the corresponding relationship between video image and vector data is established using that between an object in virtual imaging and vector object, with the purpose of accomplishing the mapping of video scene to objects in geography scene.

Figure 1

Process of the mapping relationship of video scene imaging data to geography scene vector data based on virtual viewpoint.

In the following part, we will introduce several preliminary key steps.

2.1. Selection Algorithm of Multicamera Based on Spatial Correlation and Target Priority

Multicamera surveillance system should not only gain detecting and tracking information of motion element of the single camera, but also make the coherent dynamic scene description using all the observations to some extent. Meanwhile, every motion element could be tracked by cameras simultaneously. How to select cameras for tracking a specific target is particularly important in video sensor networks. Based on the spatial correlation [28] and target priority, the paper proposes a selection algorithm of multicamera with task allocation optimized to achieve the automatic selection according to the target priority at each moment.

The algorithm is based on the assumption that a camera with no task carries out the basic single camera tracking which has lower power consumption, and the high-priority task could be preempted when bending. The selection algorithm of multicamera is shown as in Algorithm 1.

Algorithm 1: Selection of multicamera based on spatial correlation and target priority.

(1) begin

(2) $S = ϕ, I = {I_{1}, I_{2}$ , … , $I_{N}}$ , $P_{N} = P_{0}$ , $ρ (I_{i}, I_{j}) = ρ_{i j}$ .

(3) Find $(I_{i}, I_{j}) = \arg \min_{I_{i}, I_{j} \in I} {ρ (I_{i}, I_{j})}$ .

(4) Add corresponding $I_{i}, I_{j} t o S . {Count = 2}$

(5) for each $k \in Count$

(6) for each $(I_{tmp} \in I, I_{tmp} \notin S) o r (I_{tmp} \in I, I_{tmp} \in S$ , $P_{next} > P_{curr}) d o$

(7) $ρ (I_{tmp}, S) = \max_{I_{p} \in S} {ρ (I_{t m p}, I_{p})}$

(8) end for

(9) $I_{\min} = \arg \min_{I_{m} \in I, I_{m} \notin S} {ρ (I_{m}, S)}$ .

(10) add $I_{\min} t o S$ .

(11) end for

(12) return $S \subseteq {I_{1}, I_{2}, \dots, I_{Count}}$

(13) end

The set of images $I = {I_{1}, I_{2}, \dots, I_{N}}$ is observed by these N cameras, and S denotes the set of cameras selected. $ρ_{i j}$ is correlation coefficient of the two images $I_{i}$ and $I_{j}$ . The larger the correlation coefficient, the more correlated the two images. In step 6, P denotes the task priority with a default value $P_{0}$ , which can be marked manually by monitoring person. It assigns cameras to the motion element with high priority and coordinates cameras to track different targets based on spatial correlation and target priority.

2.2. Organization of Video and Location Data

We have put forward a coding model of video-GIS that is comprised of video and camera's position in conjunction with its view direction and distance. Thus, the location data can be collected automatically by various small sensors to a camera, such as a GPS and a compass (see Figure 2). This eliminates manual work and allows the annotation process to be accurate and scalable. Therefore, we investigate the real-time collection, coding, and integration of video information and GPS information on the SEED-VPM642 platform, and finally we can obtain two different bit-rate location-based streaming media. The lower bit-rate one can be positioned to the wireless network broadcast live, and the higher one can be positioned to the hard disk storage.

Figure 2

Experimental hardware and software to acquire georeferenced video.

In the coding of video-GIS, we need to calculate the three-dimensional coordinate of the video object [29]. As video-GIS coding based on mobile sensor cannot calculate single video frame by three-dimensional control field, the most effective way is using digital map and spatial geometrical relations (see Figure 3).

Figure 3

Geometry for calibrating multiple sensors.

Therefore, the geometric relationship among GPS, posture sensor, imaging space, and object space should be built. It is assumed that the axis of imaging space $x, y, z$ is parallel with that of object spatial $X, Y, Z$ , respectively. Consider

\begin{matrix} R_{G} = R_{GPS} (t) + R_{Att} (t) \cdot [s_{G} \cdot R_{C}^{Att} \cdot r_{g}^{C} (t) + r_{GPS}^{C}] . \end{matrix}

(1)

In detail, $R_{G}$ is the coordinate vector of point G in the three-dimensional space. The coordinate function of GPS antenna in the given mapping frame is expressed as $R_{GPS} (t)$ . $R_{Att} (t)$ represents the rotation matrix function while $s_{G}$ represents the proportional relationship of image frame and object spatial. Boresight matrix $R_{C}^{Att}$ means transformation relation between image frame and main framework of posture sensor. $r_{g}^{C} (t)$ represents the vector function of a g point in imaging spatial. And $r_{GPS}^{C}$ is the excursion of the geometric center of GPS antenna and the camera lens.

For acquiring a more precise spatial locating information, we need to get the GPS information and attitude information generated by a posture sensor at least. Therefore, the spatial locating information is described by the combination of GPS and angle direction elementary (Heading, Pitch, and Roll), which obtained by Micro Inertial Measurement Unit (MIMU), as shown in Table 1.

Table 1

Sample of GPS and MIMU.

GPS

UTC 10:12:15 29.564 N 106.585E Alt 213.3 Meters

HPR

Heading 33.4 Pitch 0.5 Roll 1.3

As shown in Table 1, there are two kinds of the spatial locating information: (1)

GPS information: such as UTC time and longitude latitude;

(2)

angle direction elementary information: including Heading, Pitch, and Roll.

2.3. Digital Map-Based Image Resolution

The features of digital maps are expressed by a two-dimensional plane on the vertical projection of the vector data. From the standpoint of this work, the video image is a raster data expression of the feature in the height direction of the information, and video image can also be expressed as the data format of the dotted line surface after the vector processing. Video images and digital map on the point, the line, and the corresponding expression of the surface can be shown at Table 2.

Table 2

Correspondence between video images and digital map.

Image → Digital Map		Digital Map → Image
Map Symbol	Map Object	Map Symbol	Image Object
Point	Point & line	Point	Point & line & Polygon
Line	Point & line & Polygon	Line	Point & line & Polygon
Polygon	Point & line & Polygon	Polygon	Line & Polygon

From the view of technology, we subject map-based image resolution to a three-dimensional measurement challenge and then use single-frame video images and digital map matching to define the changes in three steps. The first step is feature extraction of dense range image, which aims to extract the features of point and line. Under the premise of the full calibration to video frame, we can identify the particular characteristics of extracted target to meet the special requirement. For instance, the corners of building or telegraph pole as a fixed line characteristic for the expression of video image is perpendicular to the target. Once formulated, the second step is to combine the line characteristics into the characteristics of the surface using texture information. The third step is matching with digital map vector data. The contents include a variety of different matching points, points and lines, a line and a line, and the line and the plane between form and technique, which is shown in Figure 4.

Figure 4

Mapping from Image to Digital Map.

3. Syntactic Structure

3.1. Syntactic Description of Motion Element

Video motion element mainly refers to the entity objects that could be identified clearly in visual and are important in morphology, such as pedestrians in video surveillance. The description methods of motion element are mainly based on color and texture at present, which is difficult to support the definition of motion element, behavior analysis, and behavior understanding. For a better description of the dynamic characteristic of the video motion element, the paper first gives a definition to some related concepts of motion element.

Definition 1.

State. The state is an abstract of attributes owned by motion element and is a static description of the condition and activity of a motion element at a certain time. $S t a t e = {A p p e a r, M o v e, S t o p, D i s a p p e a r}$ indicates the basic state of any motion element within the scope of spatial constraint in a georeferenced video stream, including the description information of Appear, Disappear, Move, and Stop.

(a) Appear. The emerging motion element is newly appear and distinguished from the existing ones in the specific area of geographical boundary, and the state of which is called Appear. Then the motion element starts to be detected and tracked. Appear instance is regarded as the first instance of motion element.

(b) Disappear. In contrast with the Appear state definition, Disappear means the state of disappearance in the geographical boundary specific area or the untraceable state within a specific time, which is viewed as the last instance for the state description. Disappear state is the signal of canceling motion element detection and tracking.

(c) Stop. Stop S is defined on triple $S = (A r e a (S), ζ_{\min} (S), ζ_{\max} (S))$ . Among them, $A r e a (S)$ means the spatial plane area, and $ζ_{\min} (S)$ and $ζ_{\max} (S)$ represent the maximum and minimum time threshold of Stop, respectively. And the particular movement or stay that without markedly changed of space coordinate information within a certain region are all viewed as motionless, which is shown in Figure 5.

Figure 5

The definition of Stop.

(d) Move. Within the scope of spatial constraint, Move M is a general designation of connecting the other three basic states in a continuous motion process of motion element. An instance of Move can be represented as $M = (A p p e a r ∣ {S t o p}_{k}, {S t o p}_{k + 1} ∣ D i s a p p e a r)$ . By connecting the other three basic state instances, Move can form a linear sequence formed through the combination of Appear, Stop, and Disappear.

Definition 2.

Behavior Attribute. Behavior description of a single typical motion element mainly includes spatial location and speed. Spatial location can be defined as $L o c a t i o n (O b j e c t) = (X_{i}, Y_{i}, T_{i})$ , which means that the spatial location of the motion element Object at time point $T_{i}$ is $(X_{i}, Y_{i})$ , and $X_{i}$ and $Y_{i}$ represent the horizontal and vertical ordinate values in the two-dimensional plane, respectively. $S p e e d (O b j e c t) = {S_{Value}, S_{Vector}, T_{i}}$ indicates the motion element Object with velocity magnitude $S_{Value}$ and velocity direction $S_{Vector}$ at the time point $T_{i}$ , among which $S_{Vector}$ is the unit vector in a general planar domain.

Definition 3.

Relation. Relation is an incidence relation of mutual influence between two motion elements in the same time subspace T. $R e l a t i o n = ({O b j e c t}_{i}, {O b j e c t}_{j}, T)$ shows the relationship between motion element ${O b j e c t}_{i}$ and ${O b j e c t}_{j}$ in time subspace T which means one-dimensional time coordinates. The measurement of interaction established between the two elements uses probability P, which is dynamic adjustment with the influence of temporal-spatial factor, and $P \in [0, 1]$ .

Definition 4.

Spatial Relation. Spatial Relation includes measuring relation, direction relation, and topological relation. Spatial Relation $S R = (M e a s u r e, D i r e c t i o n, T o p o l o g y)$ . Measure indicates the measuring relation among motion element using some measure in measuring space, such as distance. In the same planar reference domain, Direction is the equity mutual relationship between source target and reference target.

Definition 5.

Visual Feature. In the georeferenced video stream, the visual characters of one motion element, including color, texture, and shape, will be dynamically changed with the time T. Therefore, the changes of visual characters of a motion element within the scope of spatial constraint should be described accurately [30]. And the visual characters mainly include Color, Texture, Shape, and Size. Texture can reflect the structure mode and gray space distribution formed by local pixels in motion element, while the low-level features can clearly define and describe the motion element.

3.2. Behavior and Interaction of Motion Element

In the georeferenced video stream, Behavior of the motion element within the specific scope of spatial constraint represents the behavior state sequence, as shown in Figure 6. Let the state set of Behavior be a BehaviorState, and the typical element is τ with the definition as follows:

\begin{matrix} τ : : = A p p e a r ∣ M o v e ∣ S t o p ∣ D i s a p p e a r; \end{matrix}

(2)

among them, Appear, Disappear, Move, and Stop indicate the four basic states of motion element, respectively.

Figure 6

Behavior state sequence of motion element.

As one of the expression forms of motion element in the video stream, Interaction represents the mutual influence or joint action caused during the course of the Relation of two behavior state instance. The necessary condition for establishing the interaction relationship is the two incidence relation between the two behavior state instances that exist at the same time. It can be defined as five-meshes

\begin{matrix} I n t e r a c t i o n = {O b j e c t, B e h a v i o r S t a t e, S R, T, R u l e} . \end{matrix}

(3)

Under the influence of temporal subspace T and spatial relation $S R$ , Interaction is the description of mutual influence between motion element ${O b j e c t}_{i}$ and ${O b j e c t}_{j}$ . Behavior state of Object can be any state instance in the BehaviorState collection, and interaction production rule and interaction optimization update rule are involved in Rule. Therefore, the measuring of interaction has two influence factors, temporal and spatial factors.

Due to the close correlation of spatial relation at any time point $T_{i + 1}$ and former $T_{i}$ , the spatial relation at $T_{i + 1}$ is always closely related to that at former time point $T_{i}$ . Thus, the spatial relation evolution process among motion elements can be defined as a Markov chain in the temporal subspace T, with its evolution having Markov quality

\begin{matrix} P_{T} {G_{t + 1} ∣ G_{t}, G_{t - 1}, \dots, G_{0}} = P_{T} {G_{t + 1} ∣ G_{t}} . \end{matrix}

(4)

Meanwhile, the measuring value P of interaction between the two motion element established Relation can be computed based on the planar spatial distance Distance, velocity magnitude, and direction angle, including the current topology at time point $T_{t}$ , as shown in Figure 7.

Figure 7

A diagram of interaction relation.

In the georeferenced video stream, the dynamic update function of interaction relation within the scope of spatial constraint is shown as follows:

\begin{array}{l} P (t + 1) \\ = Min [1, Max (0, \sqrt{P^{2} (t) + ω (t + 1) \times η (1 - c (t))})] . \end{array}

(5)

Among them, $P (t)$ represents the interaction relation measuring value between a certain motion element and others, with the range of $P \in [0, 1]$ . The higher value indicates the more hospitable relationship. When the interaction is established by behavior state instances, the initial value works as $P (0) = ρ_{1} \times D i s t a n c e (i, j) + ρ_{2} \times θ (i, j) .$ $ω (t)$ indicates the duration of interaction relation with the current state, and the dynamic change of c parameter is shown as follows:

\begin{array}{l} c (t + 1) = c (t) + a \times \frac{D_{t + 1}}{D_{t}} \times (1 - | \frac{ω (t + 1) + 1}{2} - P (t) |) \\ \times \min [c (t), 1 - c (t)]; \end{array}

(6)

c (t)

represents a new confidence level while α learning rate.

\min (c, 1 - c)

ensures parameter

c (t) \in [0,1]

4. Semantics and Formalization of Georeferenced Video

For the accurate description and behavior understanding of motion elements in the georeferenced video stream, the paper proposes an analysis method based on sparse random graphs with the purpose of observing the character evolution with time and presents an indicating and measuring method of video motion element with dynamic topology structure information based on context-sensitive sparse random graph grammar.

4.1. Formalization of Georeferenced Video

Random graph $G = (V, E, Ω)$ is defined on triple, while the edge set E of graph G with the vertex set Vis defined in probabilistic spaces Ω. Consider

\begin{matrix} P (e_{i j} \in E) = P_{i j}, P_{i j} \in (0,1), \sum P_{i j} = 1 . \end{matrix}

(7)

Each edge of random graph G is mutually independent; namely, any two vertexes that established incidence relation connected independently with probability P. As the spatial relation will be dynamically changed during the movement with the time factor, it is necessary to describe the motion state and interaction relationships within specific spatial area using random graph. Context-sensitive sparse random graph grammar can be defined as five-meshes

\begin{matrix} G = (S, V_{N}, R, δ, C h) . \end{matrix}

(8)

Among them, S is the root vertex that an initial vertex of semantic event in the georeferenced video stream. There is only one S vertex in the video event sequence. Vertex $V_{N} = {V_{1}, V_{2}, V_{3}, \dots}$ involves all the motion elements emerged in the specific spatial area. R in the formula means the evolution process and rule of random graph G while δ the state transition functions. The cohesion of random subgraph $C h$ indicates the inner coupling degree of motion element group.

The motion element vertex of random graph can be defined as follows:

\begin{array}{l} V_{i} = (i n d e x, T i m e, S t a t e, L o c a t i o n, \\ S p e e d, I n t e r a c t i o n, S R, V F) . \end{array}

(9)

It shows the motion status and interaction information of a motion element

V_{i}

labeled index at the time point Time. Among them, Location and Speed represent the position coordinate and the velocity of motion element

V_{i}

in the planar area, respectively. Interaction is the description of interaction while

S R = (M e a s u r e, D i r e c t i o n, T o p o l o g y)

the spatial relation existed in the motion element. Virtual feature

V F

shows low-level features information of a motion element including Color, Shape, and Size at the time point Time.

S t a t e = {A p p e a r, M o v e, S t o p, D i s a p p e a r}

is the basic state of motion element.

4.2. Evolution Rule

As a posterior method, dynamic process of motion elements in the video stream can be visually described and showed based on sparse random graph. The temporal and spatial evolution model of motion element is able to describe the basic character and dynamic process of spatial relation accurately. The essence of dynamic evolution process of sparse random graph is the continuous transition process of state space in random graphs.

Therefore, the state transition function of sparse random graph can be defined as a mapping relation

\begin{matrix} δ = Θ ⟶ Θ . \end{matrix}

(10)

Among them, $Θ$ is the state space of sparse random graph, $Θ = {(d_{1}, d_{2}, \dots, d_{n})}^{T}$ , and d is a variable in state region.

The dynamic evolution process of sparse random graph includes its character update of motion element vertex $V_{N}$ , emerging vertex with the Appear and Disappear behavior states, and the dynamic adjustment of edge set E and interaction relation P of random graphs. For the accurate description of event development process in georeferenced video stream, evolution rule algorithm of sparse random graph is shown in Algorithm 2.

Algorithm 2: Evolution rule algorithm of sparse random graph.

Input: sparse random graph $G_{Active}$ , motion element

detection and recognition information;

Output: return $G_{Active}$ ;

(1) IF $t = 0$ Then

(2) Create first node S & Add S to $V_{N}$ ;

(3) End IF

(4) While $t \geq 1$ do

(5) IF $V_{tmp}$ → State Is Equal Appear Then

(6) Find nearest node $V_{near}$ ;

(7) Create new edge E( $V_{tmp}$ , $V_{near}$ );

(8) Add $V_{tmp}$ to $V_{N}$ ;

(9) End IF

(10) For $V_{i} \in V_{N}$ do //Update all Nodes in $G_{Active}$

(11) IF $V_{i} \to$ State Is Equal Disappear Then

(12) Remove $V_{i}$ from $V_{N}$ ;

(13) Delete edge of $V_{i}$ in $G_{Active}$ ;

(14) End IF

(15) Update $V_{i}$ ;

(16) End For

(17) For $V_{j} \in V_{N}$ do //Update all Edges in $G_{Active}$

(18) IF Flag← getRestriction( $V_{j}$ ) Then

(19) Delete edge of $V_{j}$ in $G_{Active}$ ;

(20) Else IF Flag← getAttract( $V_{j}$ ) Then

(21) Add new edge of $V_{j}$ to $G_{Active}$ ;

(22) End IF

(23) End For

(24) For $V_{k} \in V_{N}$ do //Update P of Graph in $G_{Active}$

(25) IF $V_{k} \to$ State Is Equal Appear Then

(26) $P \leftarrow P (0)$ ;

(27) Else Update other P of $G_{Active}$ ;

(28) End IF

(29) End For

(30) Return $G_{Active}$ ;

We can get the corresponding dynamic evolution model of sparse random graph using the evolution rule algorithm. Step (2) in the algorithm shows the creating and adding root vertex S, and $G_{0} = (V_{0}, E_{0}) : = ({S}, \emptyset)$ . Adding a new motion element vertex $V_{tmp}$ in sparse random graph $G_{Active}$ is in step (5) while deleting the vanish vertex $V_{i}$ and its association edge in step (11). Among them, function getRestriction $ (V_{j})$ in step (18) and getAttract $ (V_{j})$ in step (20) indicate whether it can delete or add the edge that vertex $V_{j}$ associated, respectively. Step (27) accomplishes the dynamic update of interaction relation P in sparse random graph $G_{Active}$ .

4.3. Random Subgraph

Cohesion of random subgraph refers to the close relation of motion element. To measuring close relation, the paper introduces the concept of structural entropy. As a measuring method of messiness and randomness of the state, structure entropy is related closely to the compactness of random subgraph. The higher the compactness is, the lower the structure entropy value will be.

If vertexes $V_{i}$ and $V_{j}$ have close correlation with each other, then $P (V_{i}, V_{j}) = P (t)$ . Let $N (i) = \sum_{j = 1}^{n} P (V_{i}, V_{j})$ , associative strength $ξ (i) = N (i) / \sum_{j = 1}^{n} N (j)$ . The structure entropy of random subgraph is $H = \sum_{i = 1}^{n} ξ (i) \ln ξ (i)$ , and $\sum_{i = 1}^{n} ξ (i) = 1$ . Therefore, the Cohesion of random subgraph is $C h (G^{'}) = - \sum_{j = 1}^{n} (N (j) / n) \times (ξ (i) \ln ξ (i) / \ln n)$ , with $C h (G^{'}) \subseteq [0, 1]$ .

4.4. Early Warning of Video Event

Using the numerical calculation method of interaction relationship, abnormal behavior and emergency in video can be distinguished based on random graph grammar, and the possible special situation can be early warned. There are two different threat levels generated by video event: notify and alarm, which is shown in Figure 8.

Figure 8

Notify and Alarm processing of video event.

The paper is mainly to detect the unexpected crowd incident and conflict in the massive video events and proposes a novel two-layer discriminate method, which consists of individual attribute layer and group attribute layer. Once occurring video abnormal event, the corresponding real-time status of random graph must be described, which can be expressed as follows.

(1) Individual Attribute Layer. The owned velocity of multiple random graph nodes has modified radically in per unit of time T, and the relevant movement direction has also changed significantly.

Specifically speaking, the detection and selection of variation range or interval of movement attributes in random graph can use sliding window. In the continuous movement attribute value $V = {V_{1}, V_{2}, \dots, V_{n}}$ in time series, $V_{1}$ exists before the emergence of $V_{2},$ while $V_{2}$ exists before $V_{3}$ . The difference is obtained by the two continuous attribute values. In the paper, the data in the sliding interval $Δ T$ is viewed as the discriminative and forecasting sample, when the continuous difference $D (V_{i}, V_{j}, T)$ is larger than the given threshold, and the sliding intervals $Δ T$ is within the max time threshold. Otherwise, recalibrate over the entire sliding intervals for new computation.

(2) Group Attribute Layer. The multiple interaction and distance values among random graph nodes in groups fluctuate greatly, or the multiple numerical variations of interaction relationship in random subgraph are changed significantly. The discriminant analysis of video abnormal event is achieved according to the check whether the change rate of parameter value $\vec{r}$ is greater than the given threshold $p T h r e h$ , as

\begin{matrix} \vec{r} = \frac{d p}{d t} \geq p T h r e h . \end{matrix}

(11)

Once either circumstance occurred, it must be entering the next notify phase.

When entering the notify discriminative phase, the random subgraph showing diffusion or flocking status makes numerical calculation. Using the computing method of structure entropy value, the corresponding random subgraph status is measured, and the entropy value $C h (G^{'})$ is viewed as the warning degree of video abnormal behavior and emergency. With regard to different levels of urgency and security, the warning degree $W a r n i n g (t)$ is set to different threshold intervals as follows:

\begin{matrix} W a r n i n g (t) = C h (G^{'}) = - \sum_{j = 1}^{n} \frac{N (j)}{n} \times \frac{ξ (i) \ln ξ (i)}{\ln n} . \end{matrix}

(12)

The warning degree $W a r n i n g (t)$ is divided into three warning threshold intervals in the paper, which are Warning1, Warning2, and Warning3. Specifically, Warning1 indicates the early warning degree, which means that video abnormal event will be occurred in the next unit time and the discriminative module obtains alertness. Warning2 shows the probable warning degree and is the identifying processing transformed into the CBR phase. If the entropy value of random subgraph is greater than the max value of given threshold interval, the CBR discrimination phase works. Based on the video event features, the traditional CBR method is used to further identification. Warning3 expresses the confirmed warning degree, which can enter the Alarm phase of video abnormal event directly without the traditional CBR method.

The discriminate method based on the random graph is defined as graph-based reasoning (GBR) in the paper, while the improved GBR fused with traditional CBR method is GBR-C. The intelligent analysis for different video scenes plays an important role in the real-time detection of video abnormal behaviors and mass incidents. The instantaneous status information of video motion element is integrated with the random graph model and summarizes the random subgraph patterns and behavior rules with a statistical description. In violation of the behavior regularity of common video events, it is a latent exceptional event, and extracts the features of video motion elements involved which are recorded in object layer stream for the efficient retrieval of content-based video.

5. Experiment and Analysis

In order to verify the feasibility and availability of the proposed framework, space information of a motion element is extracted at real-time based on the detection and tracking [31, 32]. According to the dynamic change situation of space semantics, a timing description method using random graph grammar depicts the event development of video stream clearly.

5.1. Interaction Description

Interaction is the mutual incidence relation among motion element. For the accurate description of the dynamic change process of interaction relation, interaction P should be calculated real-time based on the spatial information in experimental video including planar spatial distance, velocity magnitude, and direction angle. And the calculation results of real-time interaction update function $P (t)$ of the video clip trim from frame 550 to frame 685 is shown in Figure 9.

Figure 9

Dynamic change process of interaction relation.

In Figure 9, function $P_{1}$ shows a changing trend of increasing first and then decreasing gradually in the video clip. The minimum value of interaction $P_{1}$ is at frame 685 with the value 0.11 while the maximum is at frame 586 with the value 0.38. And function $P_{2}$ indicates the changing process of two close targets. The minimum value of $P_{2}$ is at frame 592 with the value 0.23 while the maximum is at frame 685 with the value 0.79. The increasing planar spatial distance Distance and motion direction variation of two motion elements make the decreasing interaction value. On the contrary, as the planar spatial distance decreases and the duration of interaction continues to increase, interaction value P increases gradually.

The previous results show that it can accurately depict the dynamic varying changes of the interaction relation of video motion elements. However, the accurate depiction is an indispensable premise for the description of the georeferenced video stream.

5.2. Georeferenced Video Stream Description

Based on the richer spatial semantic of motion elements in the georeferenced video stream, we can realize the intelligent parsing of georeferenced video content using context-sensitive sparse random graph grammar. The spatial relationship of motion elements in image space is transformed to that of object space, and the motion status and interaction relation can be depicted using random graph. The continuous transition process of inner state space in random graph is enforced with the dynamic evolution process of sparse random graph.

With the spatial reference data, the sparse random graph evolution processing based on the monitoring target is achieved. And the consecutive people emerged within the video surveillance range are labeled as A, B, C, and D which are shown in Figure 10. As soon as the moving object appears, a new random graph node will express it; when it leaves the surveillance confine, the corresponding node will disappear while the edge set constituted by the interaction that associated with the node is set to null. Using our video test data, the evolutionary process and timing evolving description diagram of the video clip trim from frame 1041 to frame 1712 is shown in Figure 10.

Figure 10

Timing evolving description diagram of the georeferenced video stream.

We can see that the timing evolving description diagram can be constructed by the automatic intelligent analysis and calculation of a video clip, and it verifies the correctness and effectiveness of the evolution rule algorithm of sparse random graph. Within the scope of the specific geographical space, the time-varying attributes of random graph nodes are visual displayed, such as behavior state, spatial location, and movement parameter. And the basis recorded information of each video motion element is shown in Algorithm 3.

Algorithm 3: Basic information of each video motion element.

$〈 attribute name=“MotionElement” 〉$

$〈$ index=“63” //Sequence Number

State=“2” //Behavior State

frame=“612” //Current Frame Number

timeDelay=“612” //Duration

PixelX=“198” PixelY=“211” //Image Space Coordinate

LoctX=“45” LoctY=“60” //Object Space Coordinate

DeltX=“0.85” DeltY=“0.24” //Relative Distance

Speed=“(0.85, 0.24)” //Speed

InteractionNum=“1” //Interaction Relationship Number

Interaction=“ ${($ 64, 63, NE, 0.51 $)}$ ” //Interaction Relationship

// $($ Objecti, Objectj, Direction, P(t))

VF=“0”

Other=“0”/ $〉$

$〈 / attribute 〉$

Among them, the basic information consists of attribute information, spatial location information, and other movement parameter, which are shown in Algorithm 3. The attribute information State indicates the behavior status of the video motion element with succinct expressional number 0, 1, 2 and 3, which are described respectively with the four basic behavior $s t a t e {A p p e a r, D i s a p p e a r, M o v e, S t o p}$ . And the interaction relationship attribute including the index of two elements, the numerical calculation value of interaction, and the relative spatial directions. The whole structural description of video motion element generated automatically is shown in Figure 11.

Figure 11

Structural description of video motion feature.

The automatically generated file mainly consists of two parts: the configuration data and content data. The movement status information about motion element Object in the georeferenced video stream is described in detail in the content data part while the basic attribute information about testing video clip in configure data part. In the continuous period of time series, movement status information of each motion element including the behavior state sequence, real-time spatial location information, and the statistical information about interaction relation can be queried directly from the XML file. It also provides a novel simple nonlinear indexing for the understanding and description of video content.

5.3. Performance of Video Event Warning

To validate the proposed early warning method of video abnormal behavior and emergency, we analyzed the performance of various attributes using the video test data which involves a crowd video scene. Experimental analysis mainly contains the real-time warning entropy value of random subgraph, warning degree, and real-time changes of corresponding subgraph node number and the total graph node number, which are shown, respectively, in Figure 12. And the horizontal axis indicates the video running time with 10 seconds as a scale unit.

Figure 12

Structural description of video motion feature.

As can be seen from the previous illustration, the warning entropy value of real-time random subgraph using the computing method of structure entropy value is due to random fluctuations in Figure 12(a). According to the warning degree of video abnormal behavior and emergency, three different warning threshold intervals are set in our test. And the Warning2 degree occurred between 252 and 270 seconds shown in Figure 12(b). The Warning1 indicates the early warning degree in most of the time, which means that video abnormal event will be emerged. Figure 12(c) shows the real-time nodes number of random subgraph in the video surveillance scope while Figure 12(d) shows the total graph node number.

5.4. Performance Comparisons of Intelligent Analysis Methods

In this section, we compare the proposed method with other methods, such as the Coarse-Grained SVM, Fine-Grained SVM [15], and MKL [19]. Using the three sample videos (Table 3) which involve some events that contain a group of people interact with each other, we carry out the comparison study. And all the chosen samples are considered as the labeled training data within the target domain.

Table 3

Three test sample videos.

Test samples	Alarm types	Time (s)	Scenes number
A	Cross-border	372	18
B	Flocking	1423	42
C	Conflict	588	27

GBR accomplishes a concise numerical calculation and avoids the problems of computing complexity in the traditional CBR method. In Tables 4, 5, and 6, we compare the performance of GBR, GBR & CBR, with other methods using three different videos.

Table 4

Comparison of crossing sample A with different methods.

Method	Out-Detection	Correct-Detection	Omit-Detection	Time (s)
GBR	20	18	0	0.72
GBR (with CBR)	25	18	0	1.35
Coarse-grained SVM	40	16	2	1.47
Fine-grained SVM	26	15	3	1.21
MKL	30	16	2	1.50

Table 5

Comparison of flocking sample B with different methods.

Method	Out-Detection	Correct-Detection	Omit-Detection	Time (s)
GBR	48	35	7	2.74
GBR (with CBR)	51	39	3	4.33
Coarse-grained SVM	90	37	5	5.41
Fine-grained SVM	45	35	7	4.76
MKL	43	36	6	5.53

Table 6

Comparison of conflict sample C with different methods.

Method	Out-Detection	Correct-Detection	Omit-Detection	Time(s)
GBR	35	23	4	3.46
GBR (with CBR)	36	24	3	4.17
Coarse-grained SVM	53	24	3	3.90
Fine-grained SVM	37	21	6	3.54
MKL	40	22	5	3.95

From Tables 4, 5, and 6, we observe that GBR extends the processing time in a common detection of video event, but the forecasting accuracy of video abnormal behavior and emergency increased significantly with lower computation and complexity. Therefore, the energy consumption of sensors will be reduced which is consistent with the transmission costs, especially in the nonrecurring flocking emergency with complex video event modeling.

6. Conclusion

In summary, findings from the present study are all based on low-level visual features, which mean that there was a shortage of spatial constraints and coupling analysis with geography environment. It is necessary to establish the relationship between video analysis method and the real geographical scene. A georeferenced video analysis method is proposed based on the context-based random graph. The data are obtained using a wireless network of environmental sensors scattered at the supervising area and a vision sensor monitoring the same geographical area. Experimental results prove that the proposed description method of georeferenced video using random graph is feasible and efficient. Through the intelligent parsing of the georeferenced video data stream, we can get a novel visual description method using random graph which can clearly depict the development clue of video scenes and also offer the possibility to browse the video stream quickly. Meanwhile, random graph can be used as an effective nonlinear indexing for the content-based video indexing and browsing application.

As a future work, we will propose the enhancement of the implemented algorithms with alternative combination rules and the fusion of audio and video to deal with the uncertainty, imprecision, and incompleteness of the underlying information. In addition, large amounts of data should be conducted to set various parameters, such as thresholds, false alarm rates, and fusion weights.

Footnotes

Acknowledgments

The work is supported by the National Natural Science Foundation of China (41101432 and 41201378), the Natural Science Foundation Project of Chongqing (CSTC 2010BB2416), and the Education Science and Technology Foundation of Chongqing (KJ120526).

References

Akyildiz

I. F.

Melodia

Chowdhury

K. R.

A survey on wireless multimedia sensor networks

Computer Networks 2007 51 4 921 960

2-s2.0-33845708421

10.1016/j.comnet.2006.10.002

S. A.

Zimmermann

Kim

S. H.

Relevance ranking in georeferenced video search

Multimedia Systems 2010 16 2 105 125

2-s2.0-77953358353

10.1007/s00530-009-0177-x

Navarrete

Blat

VideoGIS: segmenting and indexing video based on geographic information

Proceedings of the 5th AGILE Conference on Geographic Information Science

April 2002

1 9

Larouche

Laflamme

Lévesque

Denis

Videography in Canada: georeferenced aerial videography in erosion monitoring

GIM International 2002 16 9 46 49

2-s2.0-0036714822

Davies

Cheverst

Mitchell

Efrat

Using and determining location in a context-sensitive tour guide

Computer 2001 34 8 35 41

2-s2.0-0035425370

10.1109/2.940011

Stefanakis

Peterson

Geographic Hypermedia: Concepts and Systems 2006

Springer

Lecture Notes in Geoinformation and Cartography

Klamma

Spaniol

Jarke

Cao

Jansen

Toubekis

A hypermedia afghan sites and monuments database

Geographic Hypermedia 2006 189 209

10.1007/978-3-540-34238-0_11

Pissinou

Radev

Makki

Spatio-temporal modeling in video and multimedia geographic information systems

GeoInformatica 2001 5 4 375 409

2-s2.0-0035652716

10.1023/A:1012749903497

Hwang

T. H.

Choi

K. H.

Joo

I. H.

Lee

J. H.

MPEG-7 metadata for video-based GIS applications

Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS ′03)

July 2003

3641 3643

2-s2.0-0242709496

10.

Joo

I. H.

Hwang

T. H.

Choi

K. H.

Generation of video metadata supporting video-GIS integration

Proceedings of the International Conference on Image Processing (ICIP ′04)

October 2004

1695 1698

2-s2.0-20444482852

11.

Liu

Corner

Shenoy

SEVA: sensor-enhanced video annotation

Proceedings of the 13th Annual ACM International Conference on Multimedia

November 2005

618 627

12.

S. A.

Zimmermann

Kim

S. H.

Relevance ranking in georeferenced video search

Multimedia Systems 2010 16 2 105 125

2-s2.0-77953358353

10.1007/s00530-009-0177-x

13.

Wang

Yang

A traffic parameters extraction method using time-spatial image based on multicameras

International Journal of Distributed Sensor Networks 2013 2013 17

108056

10.1155/2013/108056

14.

Burges

C. J. C.

A tutorial on support vector machines for pattern recognition

Data Mining and Knowledge Discovery 1998 2 2 121 167

2-s2.0-27144489164

15.

Zhuang

Real-time recognition of explosion scenes based on audio-visual hierarchical model

Journal of Computer-Aided Design and Computer Graphics 2004 16 1 90 97

2-s2.0-3042758017

16.

Taskar

Abbeel

Koller

Discriminative probabilistic models for relational data

Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence

2002

Morgan Kaufmann

485 492

17.

Wang

Zhang

H. J.

Learning to reduce the semantic gap in web image retrieval and annotation

Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2008

355 362

2-s2.0-57349150196

10.1145/1390334.1390396

18.

Fergus

Weiss

Torralba

Semi-supervised learning in gigantic image collections

Proceedings of the Neural Information Processing Systems 2009 522 530

19.

Mehmet

Ethem

Multiple kernel learning algorithms

Journal of Machine Learning Research 2011 12 2211 2268

20.

Liu

Luo

Shah

Recognizing realistic actions from videos “in the Wild”

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops

June 2009

1996 2003

2-s2.0-70450203660

10.1109/CVPRW.2009.5206744

21.

Duan

Tsang

Luo

Visual event recognition in videos by learning from web data

IEEE Transactions on Pattern Analysis and Machine Intelligence 2012 34 9 1667 1680

22.

Jin

Gallagher

Cao

Luo

Han

The wisdom of social multimedia: using Flickr for prediction and forecast

Proceedings of the 18th ACM International Conference on Multimedia (MM ′10)

October 2010

1235 1244

2-s2.0-78650984367

10.1145/1873951.1874196

23.

Park

Luo

Collins

R. T.

Liu

Beyond GPS: determining the camera viewing direction of a geotagged image

Proceedings of the 18th ACM International Conference on Multimedia (MM ′10)

October 2010

631 634

2-s2.0-78650987183

10.1145/1873951.1874038

24.

Cao

Luo

Gallagher

Jin

Han

Huang

T. S.

A worldwide tourism recommendation system based on geotagged web photos

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ′10)

March 2010

2274 2277

2-s2.0-78049363352

10.1109/ICASSP.2010.5495905

25.

Gilbert

Karahalios

Predicting tie strength with social media

Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems

April 2009

211 220

10.1145/1518701.1518736

26.

Bao

Y. B.

Large scale graph data processing on cloud computing environments

Chinese Journal of Computers 2011 34 10 1753 1767

27.

Van Der Hofstad

Random Evolution in Massive Graphs 2013

28.

Dai

Akyildiz

I. F.

A spatial correlation model for visual information in wireless multimedia sensor networks

IEEE Transactions on Multimedia 2009 11 6 1148 1159

2-s2.0-70349448306

10.1109/TMM.2009.2026100

29.

Jiangfan

A data coding method of multimedia GIS in limited resource of mobile terminal

Journal of Information & Computational Science 2012 9 18 5873 5880

30.

T. L.

Thonnat

Boucher

Brémond

Surveillance video indexing and retrieval using object features and semantic events

International Journal of Pattern Recognition and Artificial Intelligence 2009 23 7 1439 1476

2-s2.0-70450207075

10.1142/S0218001409007648

31.

Zhao

Nevatia

Tracking multiple humans in crowded environment

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004

July 2004

II406 II413

2-s2.0-5044227162

32.

Zhang

X. Y.

X. J.

Zhou

Wang

X. G.

Zhang

Y. Y.

Automatic detection and tracking of maneuverable birds in videos

Proceedings of the International Conference on Computational Intelligence and Security (CIS ′08)

December 2008

185 189

2-s2.0-60349106147

10.1109/CIS.2008.46