A Missing Sensor Data Estimation Algorithm Based on Temporal and Spatial Correlation

Abstract

In wireless sensor network, data loss is inevitable due to its inherent characteristics. This phenomenon is even serious in some situation which brings a big challenge to the applications of sensor data. However, the traditional data estimation methods can not be directly used in wireless sensor network and existing estimation algorithms fail to provide a satisfactory accuracy or have high complexity. To address this problem, Temporal and Spatial Correlation Algorithm (TSCA) is proposed to estimate missing data as accurately as possible in this paper. Firstly, it saves all the data sensed at the same time as a time series, and the most relevant series are selected as the analysis sample, which improves efficiency and accuracy of the algorithm significantly. Secondly, it estimates missing values from temporal and spatial dimensions. Different weights are assigned to these two dimensions. Thirdly, there are two strategies to deal with severe data loss, which improves the applicability of the algorithm. Simulation results on different sensor datasets verify that the proposed approach outperforms existing solutions in terms of estimation accuracy.

1. Introduction

In recent years, with the development of sensing technology, wireless communication, and computing technology, wireless sensor network (WSN) [1] has been a focus of research and attracts strong attention from military, industry, and academia. In many applications of WSN, data loss [2, 3] is common due to limited resources of sensor nodes [4], interference of noise, and influence of environment. Even in some special situation, this phenomenon is very serious [5] which brings a big challenge for a variety of sensor data processing. If these missing values cannot be filled in accurately, the existing analysis tools cannot be applied. If the missing data are directly deleted, a large amount of raw data will be lost which will reduce the accuracy and reliability of analysis results and cause a great waste of energy. Data estimation algorithms can effectively solve this problem, and they provide strong support for query [6], aggregation, transmission, and warning [7]. So missing data estimation is particularly important for various applications of WSN.

However, the traditional data estimation methods [8] cannot be directly used in WSN. Sensor data estimation methods should consider the characteristics of the application system and sensor data. While many studies on sensor data estimation have been conducted and some achievements have been made, there are still some issues unresolved such as underutilization of sensor data's properties, high computational complexity, and low estimation accuracy.

We present a Temporal and Spatial Correlation Algorithm (TSCA) to estimate missing data in this paper. There are four main innovations of this algorithm. Firstly, it saves all the data sensed at the same time as a time series, and the most relevant series are selected as the analysis sample, which improves efficiency and accuracy of the algorithm significantly. Secondly, it selects the most data-relevant sensor nodes and gets spatial estimation based on comprehensive instantaneous rate of change. In the time dimension, it differentiates the order of past frames to estimate the missing rate which highlights the timeliness of sensor data. Thirdly, different weights are assigned to temporal and spatial dimensions to get the final result. Finally, there are two strategies to deal with severe data loss, which improves the applicability of the algorithm.

The rest of this paper is organized as follows. Section 2 presents the classic estimation algorithms of missing sensor data. Section 3 presents the framework of the algorithm proposed in this paper. Section 4 describes specific design of our algorithm and extends to severe loss scenes. Section 5 evaluates the proposed approach through simulation experiment. Section 6 concludes this paper.

2. Related Work

The estimation algorithms of missing data have been extensively researched in statistics, for example, Mean Substitution, Imputation by Regression, Expectation Maximization, Maximum Likelihood, Multiple Imputations, Bayesian Estimation, and Hot/Cold Deck Imputation [9]. However, none of these algorithms can be used in WSN, because they require the data miss at random and their efficiency is low.

To solve sensor data missing problem, Tiny DB [10] which is a mainstream sensor database system uses the mean of data sensed by other nodes directly as the estimated value. However, when the relationship among the sensor nodes is weak, the estimation result is not precise. MASTER-M algorithm [11] computes the similarity between sensor nodes and sorts them. It selects nodes which have high missing rate as seeds and clusters the whole network into several groups. MARSTER-tree is used to estimate missing data in each cluster. However, the relationship between the sensor nodes is not transitive; for example, $S 1$ and $S 2$ , $S 2$ and $S 3$ are similar but $S 1$ and $S 3$ may not be similar. So in an n nodes network, $C_{n}^{2}$ calculations and comparisons need to be conducted in each process of clustering. If the similar relationships between the sensor nodes change rapidly, reclustering is needed constantly which will cause high computational complexity. Adaptive Multiple Regression (AMR) algorithm is proposed in [12]. Sample data and the most relevant sensor nodes are determined heuristically. Missing values are evaluated using linear regression models according to the data of the relevant nodes. The key steps in this algorithm are realized heuristically which will increase the computational complexity. In addition, the location-related nodes are not always data-related; for example, in a place with several heat sources [13], the nodes which are near heat sources but far apart from each other may be more relevant. So location-based association mining is not accurate. Assessment using linear regression models also increases errors. Grey System Estimate Algorithm (GSEA) [14] estimates missing values based on gray model. Minimized Similarity Distortion (MSD) [15] uses linear regression to evaluate the loss. The accuracy of both GSEA and MSD is poor.

The above algorithms only consider the temporal or spatial correlation and few algorithms take both of them into account. Environmental Space Time Improved Compressive Sensing (ESTI-CS) algorithm [16] is based on compressed sensing. This algorithm uses L1 norm optimization method for solving the reconstructed signal and it requires iteration which causes high complexity. Reference [17] proposes Trend Regression Expanding Cluster Interpolation (TRECI) algorithm which considers the change of sensor data over the time. Sensor nodes are divided into several groups dynamically and time interpolation assessments are conducted within each group. It only analyzes similarity rather than predicting the loss in the spatial dimension. Data Estimation using Statistical Model (DESM) [18] algorithm estimates the missing data based on the propagation characteristics of physical quantities in the time dimension; for example, according to the fact that light intensity is inversely proportional to the square of the distance, the light intensity can be estimated in certain region. In the spatial dimension, it estimates missing data based on the correlation between the estimated node and its surrounding nodes. The disadvantage of this algorithm is that it is only appropriate for attributes which have explicit physical models. Besides, the estimation in the spatial dimension is rough. Reference [19] proposes Mining Autonomously Spatial-Temporal Environmental Rules (MASTER) algorithm. It mines association of sensor data in temporal and spatial dimensions. A big drawback of this algorithm is that when the relationship among sensor data is weak, the prediction is very inaccurate.

3. Framework of Proposed Algorithm

Sensor data collected by a node $S i$ can be seen as a time series $S i = [(V_{i 1}, T_{1})$ , $(V_{i 2}, T_{2}), \dots, (V_{i n}, T_{n})]$ . $V_{i k}$ is the sensing data at $T_{k}$ . For any time $T_{k}$ ( $k = 1,2, \dots, n$ ), if the data $V_{i k}$ is lost, seeking the estimated value $V_{i k}^{'}$ and minimizing $|V_{i k}^{'} - V_{i k}|$ are the missing data estimation problem.

From the comparison of difference between two consecutive intervals and difference between neighbors [16], we can see that most of measured data in real world always change stably; that is, there is little mutation on environmental value between adjacent time slots. In addition, environments are often smooth in a small area; that is, over a period of time, environmental values are similar among some nodes. Thus, we can use spatiotemporal correlations to estimate the missing data.

Considering that the existing missing data estimation algorithms have not made full use of features of sensor data and they have high computational complexity as well as low accuracy, this paper proposes a missing data estimation algorithm based on temporal and spatial correlations as shown in Figure 1. The evaluation result of this algorithm is Estimate which can be computed by the following formula:

\begin{matrix} Estimate = \sum_{i = 1}^{s_n} w i * V_Spatial + (1 - \sum_{i = 1}^{s_n} w i) * V_Temple, \end{matrix}

(1)

where

V_Spatial

and

V_Temple

are the analysis results of spatial and temporal correlations.

w i

is the weight of each relevant sensor node.

s_n

is the number of sensor nodes used to estimate the missing data.

Figure 1

Framework of the algorithm in this paper.

This algorithm consists of three parts: (i)

Firstly, the algorithm needs to determine the sample data used in the process of analysis. Because sensor data is time-sensitive, using a different number of sensor data for analysis will get different results. Relationship between the sensor nodes in different periods is not the same, so selecting appropriate data used for analysis is important. Sensor nodes sense data periodically. The algorithm in this paper saves data sensed by all the nodes at the same time as a series. Continuous period produces continuous time series. For example, sensed data at $t_{i}, t_{i + 1}, t_{i + 2}, \dots$ can be saved as the continuous time series $(V_{S 1 t_{i}}, V_{S 2 t_{i}}, V_{S 3 t_{i}}, \dots, V_{S m t_{i}})$ , $(V_{S 1 t_{i + 1}}, V_{S 2 t_{i + 1}}, V_{S 3 t_{i + 1}}, \dots, V_{S m t_{i + 1}})$ , and $(V_{S 1 t_{i + 2}}, V_{S 2 t_{i + 2}}, V_{S 3 t_{i + 2}}, \dots, V_{S m t_{i + 2}}), \dots$ . The most relevant time series are selected based on the correlation function as the sample. It cannot only ensure that there are no redundant sample data which will reduce the computational complexity but also ensure that the sample data has the strongest correlation with missing data which will improve the accuracy of the analysis.

(ii)

Secondly, correlation analyses are conducted in the spatial dimension. The distance between sensor nodes is defined according to the requirement of estimation. The most relevant sensor nodes are selected based on the distance function through analyzing the aforementioned sample data. Those relevant nodes are used to get spatial estimation. The weight of each relevant node $w i$ is determined according to the average correlation coefficient with the estimated node.

(iii)

Thirdly, in the time dimension, estimation is based on the sample data sensed by the estimated node. In order to give full play to the timeliness of data, past frames are distinguished chronologically during the process of analysis, so the contribution of newer data is greater. The weight of temporal estimation is $1 - \sum_{i = 1}^{S_n} w i .$ Temporal and spatial results are integrated to obtain the final estimation value.

4. Detailed Design of TSCA

4.1. Select Sample Data

The relationship between the sensor nodes in WSN will change over time, so analyzing different sample data will generate different relationship, and we get different assessment values. In addition, the size of sample data will have a great impact on the assessment results. Due to the interference of environmental noise, too little sample data cannot reflect the spatiotemporal correlation of sensor data fully, while excessive sample data reflect the average value over an extended period of time rather than the instantaneous correlation which will reduce the accuracy of the assessment. Therefore, the values and the size of sample data should be determined as accurately as possible.

Considering the fact that the spatiotemporal correlation of sensor data approximately remains constant in a short period of time, when we assess the missing data at $t_{n}$ , data close to $t_{n}$ should be selected accurately as the sample.

In WSN, sensor nodes are deployed in the given area. All the sensor nodes can be listed as $(S 1, S 2, S 3, \dots, S m)$ . These sensor nodes report sensing data at a certain time interval. At time $t_{i}$ , all the reported data constitute a time series $S (t_{i}) = (S 1_{t_{i}}, S 2_{t_{i}}, S 3_{t_{i}}, \dots, S m_{t_{i}})$ . Data sensed at many contiguous moments form a random process $S (t)$ , as shown in Figure 2. Assuming that certain sensor data loses at $t_{n}$ , we analyze its average correlation with the former time series to determine the optimal sample data:

\begin{matrix} R = \frac{1}{n - t_k} \sum_{j = n - 1}^{t_k} R s s (S_{t_{n}}, S_{t_{j}}) \\ objective: \min k \\ subject  to: R = \max (R) . \end{matrix}

(2)

Figure 2

Select sample data.

As validated by practical data, the correlation of time series is basically stable in a short period of time and then follows a decreasing trend. So we can get the most relevant sample data $t_k ~ (n - 1)$ based on formula (2). $t_k$ is determined heuristically which is initially set to $n - 1$ . Correlation between $t_{n}$ and $t_{n - 1}$ is calculated firstly; then, $t_k$ moves forward and the average correlation values are calculated until the average correlation function is maximized. In Figure 2, we can see that $t_k = i$ , so the data between $t_{i} ~ t_{n - 1}$ are the sample data. $R s s$ which is the value of correlation between two time series can be computed as in the following formula:

\begin{matrix} R s s (S_{t_{i}}, S_{t_{i - 1}}) = Z_{t_{i}} Z_{t_{i - 1}}^{T}, \end{matrix}

(3)

where

Z_{t_{i}}

is the standardized result of vector

S (t_{i})

\begin{matrix} Z_{t_{i}} = normalize (S_{t_{i}}) = (\frac{S 1_{t_{i}}}{\sqrt{S 1_{t_{i}}^{2} + S 2_{t_{i}}^{2} + \dots + S n_{t_{i}}^{2}}}, \frac{S 2_{t_{i}}}{\sqrt{S 1_{t_{i}}^{2} + S 2_{t_{i}}^{2} + \dots + S n_{t_{i}}^{2}}}, \dots, \frac{S n_{t_{i}}}{\sqrt{S 1_{t_{i}}^{2} + S 2_{t_{i}}^{2} + \dots + S n_{t_{i}}^{2}}}) . \end{matrix}

(4)

The pseudocode of selecting process is described as in Algorithm 1.

Algorithm 1: Procedure SelectSampleData.

Input:

$S_{m \times t}$ : matrix of sensor data

Output:

$S_{m \times (t - i + 1)}$ : a collection of sample data

Main Steps:

(1) $Z_{t} \leftarrow normalize (S_{t})$

(2) $R \leftarrow 0$

(3) for $i = t - 1$ to 1 do

(4) $Z_{i} \leftarrow normalize (S_{j})$

(5) $R s s (S_{t}, S_{i}) \leftarrow Z_{t} Z_{i}^{T}$

(6) $R_last \leftarrow R$

(7) $R \leftarrow (R + R s s (S_{t}, S_{i})) / (t - i)$

(8) if $R < R_last$

(9) return i;

(10) end for

4.2. Spatial Correlation

Definition 1.

If the sample datasets (data sensed between $t_{i} ~ t_{n - 1}$ ) reported by sensor nodes i, j are $S i$ and $S j$ , data dissimilarity of these two nodes is $d_d i f f (S i_{t_{n}}, S j_{t_{n}}) = |S i - S j|$ , the collections of lost data are $S i_m i s s$ and $S j_m i s s$ , the frequency of data loss at the same time is $d_m i s s (S i_{t_{n}}, S j_{t_{n}}) = |S i_m i s s \cap S j_m i s s|$ , and the size of sample data is $s a m p l e_s i z e = |S i| = |S j|$ .

Definition 2.

The distance between sensor nodes $S i$ and $S j$ is $d ({S i}_{t}, {S j}_{t})$ at t:

\begin{matrix} d ({S i}_{t}, {S j}_{t}) = \frac{\sqrt{{d_diff ({S i}_{t}, {S j}_{t})}^{2} + {d_miss ({S i}_{t}, {S j}_{t})}^{2}}}{sample_size}, d ({S i}_{t}, {S i}_{t}) = 1 . \end{matrix}

(5)

If $S j$ loses data with the estimated node $S i$ at the same time t, then $d (S i_{t}, S j_{t}) = 1$ . For example, in Figure 3, sensor node 3 will be estimated at $t_{n}$ . If there are missing data of a node i ( $i = 1,2, 4, \dots, n$ ) at $t_{n}$ , then $d (S i_{t_{n}}, S 3_{t_{n}}) = 1$ .

Figure 3

Spatial correlation.

As shown in Figure 3, in order to estimate missing data of sensor node $S 3$ , distance between $S 3$ and all the other nodes $S 1, S 2, S 4, \dots, S m$ will be computed to get an array $d (S 3_{t_{n}}) = [d (S 1_{t_{n}}, S 3_{t_{n}}), d (S 2_{t_{n}}, S 3_{t_{n}}), \dots, d (S m_{t_{n}}, S 3_{t_{n}})]$ . Select the nodes whose distance from $S 3$ is smaller than the threshold value (the default is 0.2 in this paper) according to $d (S 3_{t_{n}})$ . These selected sensor nodes which have strong spatial correlation with node $S 3$ compose the collection $S_C o r r e l a t e$ .

Each node in $S_C o r r e l a t e$ estimates the missing data based on its instantaneous rate of change at $t_{n}$ . Different weights are distributed to them according to the spatial correlation. The spatial correlation estimation is computed by the following:

\begin{matrix} V_Spatial = \sum_{S i} w i * V s j (t_{n - 1}) * \frac{d V (S i_{t_{n}})}{d t_{n}} S i \in S_Correlate, \end{matrix}

(6)

where

S i

is the sensor node in

S_C o r r e l a t e

V S j (t_{n - 1})

is the value of node

S j

at the first moment before

t_{n}

d V (S i_{t_{n}}) / d t_{n}

is the instantaneous change rate of the relevant node

S i

t_{n}

which can be approximated as the change rate between

t_{n}

and

t_{n - 1}

; that is,

d V (S i_{t_{n}}) / d t_{n} = (V (S i_{t_{n}}) - V (S i_{t_{n - 1}})) / (t_{n} - t_{n - 1})

$w i$ is the weight corresponding to $S i$ , which is determined by the average correlation coefficient between the sensor nodes. The way to calculate $w i$ is shown in the following:

\begin{matrix} w i = \frac{ψ (S i, S j)}{|S_{Correlate}|} = \frac{cov (S i, S j)}{σ s i * σ s j * |S_{Correlate}|} = \frac{E [(S i - E (S i)) * (S j - E (S j))]}{σ s i * σ s j * |S_Correlate|} . \end{matrix}

(7)

The pseudocode of analysis in spatial correlation is described as in Algorithm 2.

Algorithm 2: Procedure AnalysisInSpace.

Input:

$S_{m \times (t - i + 1)}$ : sample data

$S_{miss}$ : estimated sensor node

V: threshold of distance

Output:

$V_Spatial$ : estimation value in spatial dimension

Main Steps:

(1) $V_Spatial \leftarrow 0$

(2) for $k = t$ to $t - i + 1$ do

(3) $d_S 3 [t - k + 1] \leftarrow d (S_{k}, S_{miss})$

(4) if $d_S 3 [t - k + 1] < = V$

(5) $S_{k} \in S_Correlate$

(6) end if

(7) end for

(8) for each $S_{k} \in S_Correlate$

(9) $w_{k} ⟵ \frac{ψ (S_{k}, S_{miss})}{|S_{Correlate}|}$

(10) $r_{k} ⟵ \frac{d V (S k_{t})}{d t}$

(11) $V_Spatial \leftarrow V_Spatial + w_{k} * r_{k} * V s_{miss} (t_{n - 1})$

(12) end for

4.3. Temporal Correlation

As shown in Figure 4, we estimate the missing data based on historical sample data of the estimated node. Evaluated result is obtained by a comprehensive measure on the variation of sample data. Change rate of data is defined as $r_{t_{n}}$ :

\begin{matrix} r_{t_{n}} = \frac{\nabla V t_{n}}{t_{n} - t_{n - 1}} = \frac{{V t}_{n} - {V t}_{n - 1}}{t_{n} - t_{n - 1}}, \end{matrix}

(8)

where

V_{t_{n}}

is the sensing data of the estimated node at

t_{n}

Figure 4

Temporal correlation.

Change rates of all sample data are computed, and different weights $w_{t_{n}}$ are given to them; then, we can get the weighted rate of change $r_{w}$ :

\begin{matrix} r_{w} = \sum r_{t} * w_{t} . \end{matrix}

(9)

Based on the weighted rate of change and value of the estimated node at $t_{n - 1}$ , we can get the temporal estimation $V_T e m p l e$ :

\begin{matrix} V_Temple = V_{t_{n - 1}} + r_{w} * (t_{n} - t_{n - 1}) . \end{matrix}

(10)

The way to calculate the weighted rate of change is listed in Table 1, where $r_{t}$ is the rate of change. $w_{t}$ based on the sequence is assigned to each $r_{t}$ . The pseudocode of analysis in temporal correlation is described in Algorithm 3.

Table 1

Weighted rate of change.

Time	$t_{n - 1}$	$t_{n - 2}$	$t_{n - 3}$	$t_{n - 4}$	⋯	$t_{i}$

Data	$V t_{n - 1}$	$V t_{n - 2}$	$V t_{n - 3}$	$V t_{n - 4}$	⋯	Vt _i

$r_{t}$		$\frac{\nabla {V t}_{n - 1}}{t_{n - 1} - t_{n - 2}}$	$\frac{\nabla {V t}_{n - 2}}{t_{n - 2} - t_{n - 3}}$	$\frac{\nabla {V t}_{n - 3}}{t_{n - 3} - t_{n - 4}}$	⋯	$\frac{\nabla {V t}_{i + 1}}{t_{i + 1} - t_{i}}$

$w_{t}$		$\frac{n - i - 1}{\sum_{k = 1}^{n - i - 1} k}$	$\frac{n - i - 2}{\sum_{k = 1}^{n - i - 1} k}$	$\frac{n - i - 3}{\sum_{k = 1}^{n - i - 1} k}$	⋯	$\frac{1}{\sum_{k = 1}^{n - i - 1} k}$

Algorithm 3: Procedure AnalysisInTime.

Input:

$S_{miss \times (t - i + 1)}$ : sample data of estimated sensor node

Output:

V_Temple: estimation value in temporal dimension

Main Steps:

(1) for $j = t - 1$ to i do

(2) $r_{t_{j}} ⟵ \frac{\nabla V t_{j}}{t_{j} - t_{j - 1}}$

(3) $w_{t_{j}} ⟵ \frac{j + 1 - i}{\sum_{k = 1}^{n - i} k}$

(4) end for

(5) $r_{w} \leftarrow \sum r_{t_{j}} * w_{t_{j}}$

(6) $V_Temple \leftarrow V_{t_{n - 1}} + r_{w} * (t_{n} - t_{n - 1})$

4.4. Discussion

Unlike traditional missing data [20], sensor data have five typical patterns of missing [16] which are Element Random Loss, Block Random Loss, Element Frequent Loss in Row, Successive Elements Loss in Row, and Combinational Loss, as shown in Figure 5. The algorithm in this paper uses Combinational Loss mode, that is, any combination of the first four modes. In order to improve the applicability of our algorithm, we take a certain strategy to make the algorithm suitable for some serious loss situations. This algorithm estimates missing values from the spatiotemporal aspects, so severe loss mainly shows up as rows or columns missing continuously.

Figure 5

Data loss patterns in WSN (the black cells represent missing data).

As for severe data loss in time series, if data missing rate of a time series exceeds a certain threshold (the default is 40% in this paper), this time series will be ignored and we move forward to select sample data. As shown in Figure 6(a), the loss of time series at $t_{n - 2}$ is serious, so this moment will be ignored in the selection process of the sample data.

Figure 6

Severe data loss patterns in this paper (the black cells represent missing data).

If data missing rate of the estimated sensor node does not exceed the threshold (the default is 50% in this paper), missing data will be ignored and the algorithm described before will be used to estimate the missing data directly. If the missing rate of sample data exceeds the threshold, the algorithm will obtain the final result through iteration. As shown in Figure 6(b), node 6 has a serious lack of sample data. Compute the values of node 6 at $t_{n - 6}$ , $t_{n - 5}$ , $t_{n - 3}$ , and $t_{n - 2}$ in turn until the missing rate is less than the threshold. Every iteration is conducted based on the results of last estimation. So if the result of previous estimation is not accurate enough, the estimation error in the next time will increase. However, the algorithm in this paper avoids the iteration in spatial correlation analysis by calculating the distance. Iteration occurs only in temporal correlation analysis. From the simulation results in the fifth section, it can be seen that the iterative error of our algorithm is small.

5. Performance Evaluation

The algorithm proposed in this paper is evaluated over real-world data, namely, Intel-lab dataset [21]. This dataset is a trace of readings from 54 sensor nodes deployed in the Intel Research Berkeley Lab. These sensor nodes collected light, humidity, temperature, and other information once every 30 s from February 28 to April 5, 2004.

Since the original dataset contains missing values, in order to evaluate the performance of the algorithm, we select the relatively complete part of the test data through deleting sensor nodes which contain serious data loss. For example, when the sampling interval is set to five minutes, there is a serious lack of sensor data in nodes 5 and 15 (with 90% of data lost). So data of these two sensor nodes will not be selected as sample. In this paper, we use the accuracy of the estimation as the evaluation criteria. Specifically, we use Root Mean Square Error (RMSE):

\begin{matrix} RMSE = \sqrt{average {(V s j (t_{i}) - V^{'} s j (t_{i}))}^{2}}, \end{matrix}

(11)

where

V s j (t_{i})

is the known value which is assumed as missing data.

V^{'} s j (t_{i})

is the estimated value of

V s j (t_{i})

To verify the effectiveness of the algorithm proposed in this paper, we compare it against other algorithms—AMR [12], TRECI [17], DESM [18], and MASTER [19].

5.1. Convergence

Loss rate of raw data is about 5%. We verify the validity of the first step in this algorithm on the original temperature dataset where the sampling interval is set to 5 min. By calculating the average correlation, it can be known that the size of the sample data is 13. We choose a different number of data, and accuracy comparison of results is shown in Figure 7. It shows that a small or too large amount of data will cause an increase in the error rate. So we choose the smallest advisable size to ensure the accuracy while reducing the complexity of the algorithm. In Figure 8, we compare different algorithms against the size of required sample converging to the optimal solution. It can be seen that TSCA converges fast and has the best performance.

Figure 7

RMSE versus the size of sample data.

Figure 8

Convergence.

5.2. Estimation on Temperature

Error rate is compared among different algorithms on the original dataset where different sampling intervals are set, as shown in Figure 9. The spatiotemporal correlation of temperature is strong, so MASTER can obtain accurate relationships based on mining correlation rules. But a few of sensor nodes which are not associated with others will increase the estimation error, so its error is slightly larger than TSCA. As the sampling interval increases, temporal correlation of the sensor data weakens. TRECI and DESM use temporal correlation, so estimation error increases. However, the increase of DESM is slight because it also considers spatial correlation. The spatial correlation of the indoor sensor node in a short period of time remains substantially constant, so the sampling interval has little effect on AMR which only considers the spatial correlation. Particularly, TSCA takes the temporal and spatial correlation into account and assigns different weights according to the time series of data which makes newer data playing a more important role in the evaluation, so the size of the sampling interval has less effect on the results.

Figure 9

RMSE versus sampling interval on temperature.

According to [16], 23% of data are lost among 84,600 time slots (one month) of Intel Indoor dataset. Therefore, we conduct the error comparison among different algorithms where data missing rate is set as 5%–35% and the sampling interval is set as 5 min. Figure 10 shows that error of all the algorithms increases with the missing rate. This is because spatiotemporal correlation of sensor data will become weaker as missing rate increases. However, TSCA takes corresponding strategies based on the patterns of data loss as described in Section 4 which reduces errors greatly.

Figure 10

RMSE versus data loss on temperature.

5.3. Estimation on Humidity

Error of humidity estimation is compared among different algorithms on the original dataset where the sampling intervals are set as 1–30 min, as shown in Figure 11. Compared with temperature, spatiotemporal correlations of humidity are weaker, and the spatial correlation is much weaker than the temporal one. AMR is only based on the spatial correlation, so its error is maximal. Like temperature, temporal correlation of the sensor data weakens with the sampling interval increasing. TRECI is mainly based on temporal correlation so its error increases remarkably. When the sampling interval reaches 30 min, error of TRECI exceeds AMR algorithm's. Results of the other three algorithms are similar, but the error of TSCA is still the smallest.

Figure 11

RMSE versus sampling interval on humidity.

Figure 12 shows error rate in the situation of different data loss probability. When loss rate is more than 20%, spatial and temporal correlations of humidity are severely affected and error rate of DESM, TRECI, and AMR surges. Loss rate has a greater impact on the temporal correlation, so error rate of TRECI increases more significantly. TSCA is mainly based on the latest data and the missing data in the sample have been processed, so its performance remains relatively stable.

Figure 12

RMSE versus data loss on humidity.

6. Conclusion

Considering the deficiencies of the existing algorithms for missing data assessment, TSCA is proposed in this paper based on spatiotemporal correlation of sensor data. This algorithm selects the most relevant data as the analysis sample which ensures that there are no redundant sample data and the sample has the strongest correlation with the missing data. Thus, the efficiency and accuracy of this algorithm are significantly improved. What is more, a comprehensive analysis of the time and space is conducted to get estimation for missing data. Experimental results show that, no matter what the cases, TSCA always performs the best compared with other algorithms.

In the future, we can exploit the correlations between different attributes to further improve the accuracy of estimation; for example, light has an impact on temperature in many scenarios.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This paper is partly supported by the National Natural Science Foundation of China (61272515, 61121061), Beijing Higher Education Young Elite Teacher Project (YETP0474), and National Science & Technology Pillar Program (2015BAH03F02).

References

Yick

Mukherjee

Ghosal

Wireless sensor network survey

Computer Networks 2008 52 12 2292 2330

10.1016/j.comnet.2008.04.002

2-s2.0-46449122114

Alippi

Boracchi

Roveri

On-line reconstruction of missing data in sensor/actuator networks by exploiting temporal and spatial redundancy

Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN)

2012

1 8

Moayedi

Foo

Y. K.

Soh

Y. C.

Adaptive Kalman filtering in networked systems with random sensor delays, multiple packet dropouts and missing measurements

IEEE Transactions on Signal Processing 2010 58 3 1577 1588

10.1109/tsp.2009.2037853

MR2730100

2-s2.0-78651360589

Fletcher

A. K.

Rangan

Goyal

V. K.

Estimation from lossy sensor data: jump linear modeling and Kalman filtering

Proceedings of the 3rd International Symposium on Information Processing in Sensor Networks (IPSN ′04)

April 2004

ACM

251 258

2-s2.0-3042702609

Liu

Zhao

Tang

S.-J.

X.-Y.

Dai

Canopy closure estimates with GreenOrbs: sustainable sensing in the forest

Proceedings of the 7th ACM Conference on Embedded Networked Sensor Systems (SenSys ′09)

November 2009

ACM

99 112

10.1145/1644038.1644049

2-s2.0-74549201690

Madden

Franklin

M. J.

Hellerstein

J. M.

Hong

The design of an acquisitional query processor for sensor networks

Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ′03)

June 2003

San Diego, Calif, USA

ACM

491 502

10.1145/872757.872817

Zhu

L. M.

SEER: metropolitan-scale traffic perception based on lossy sensory data

Proceedings of the 28th Conference on Computer Communications (IEEE INFOCOM ′09)

April 2009

217 225

10.1109/infcom.2009.5061924

2-s2.0-70349653488

Bendat

J. S.

Piersol

A. G.

Random Data: Analysis and Measurement Procedures 2011

New York, NY, USA

John Wiley & Sons

10.1002/9781118032428

MR2839170

Gruenwald

Chok

Aboukhamis

Using data mining to estimate missing sensor data

Proceedings of the 17th IEEE International Conference on Data Mining Workshops (ICDM ′07)

October 2007

207 212

10.1109/icdmw.2007.103

2-s2.0-49549091727

10.

Madden

S. R.

Franklin

M. J.

Hellerstein

J. M.

Hong

TinyDB: an acquisitional query processing system for sensor networks

ACM Transactions on Database Systems 2005 30 1 122 173

10.1145/1061318.1061322

2-s2.0-23944487783

11.

Gruenwald

Yang

Sadik

M. S.

Using data mining to handle missing data in multi-hop sensor network applications

Proceedings of the 9th ACM International Workshop on Data Engineering for Wireless and Mobile Access

2010

ACM

9 16

12.

Pan

Gao

Liu

A spatial correlation based adaptive missing data estimation algorithm in wireless sensor networks

International Journal of Wireless Information Networks 2014 21 4 280 289

10.1007/s10776-014-0253-9

2-s2.0-84911983639

13.

Silberstein

Braynard

Yang

Constraint chaining: on energy-efficient continuous monitoring in sensor networks

Proceedings of the ACM SIGMOD International Conference on Management of Data

June 2006

ACM

157 168

10.1145/1142473.1142492

2-s2.0-34250676986

14.

Liu

You

Shan

Liu

A grey system based missing sensor data estimation algorithm

Proceedings of the 2nd International Conference on Computer Science and Network Technology (ICCSNT ′12)

December 2012

IEEE

482 486

10.1109/iccsnt.2012.6525982

2-s2.0-84880196943

15.

Niu

Zhao

Qiao

A missing data imputation algorithm in wireless sensor network based on minimized similarity distortion

Proceedings of the 6th International Symposium on Computational Intelligence and Design (ISCID ′13)

October 2013

235 238

10.1109/iscid.2013.172

2-s2.0-84901029172

16.

Kong

Xia

Liu

X.-Y.

M.-Y.

Liu

Data loss and reconstruction in sensor networks

Proceedings of the 32nd IEEE Conference on Computer Communications (INFOCOM ′13)

April 2013

Turin, Italy

1654 1662

10.1109/infcom.2013.6566962

2-s2.0-84883073410

17.

Appice

Ciampi

Malerba

Guccione

Using trend clusters for spatiotemporal interpolation of missing data in a sensor network

Journal of Spatial Information Science 2013 6 119 153

10.5311/josis.2013.6.102

2-s2.0-84906786820

18.

Deshmukh

W. P.

Data estimation in sensor networks using physical and statistical methodologies

Proceedings of the 28th International Conference on Distributed Computing Systems (ICDCS ′08)

July 2008

IEEE

538 545

10.1109/icdcs.2008.22

2-s2.0-51849098111

19.

Chok

Gruenwald

Spatio-temporal association rule mining framework for real-time sensor network applications

Proceedings of the ACM 18th International Conference on Information and Knowledge Management (CIKM ′09)

November 2009

ACM

1761 1764

10.1145/1645953.1646224

2-s2.0-74549138226

20.

Zhu

Compressive sensing approach to urban traffic sensing

Proceedings of the 31st International Conference on Distributed Computing Systems (ICDCS ′11)

July 2011

IEEE

889 898

10.1109/icdcs.2011.35

2-s2.0-80051895235

21.

Madden

Intel Berkeley research lab data, http://www.select.cs.cmu.edu/data/labapp3/index.html