Abstract
Compressed sensing (CS) is an emerging sampling technique by which the data sampling and aggregating can be done simultaneously, which can be applied to many fields, including data processing in wireless sensor networks (WSNs). In WSNs, data aggregating can reduce data transmission cost and improve energy efficiency. Existing CS-based data gathering work in WSNs utilizes the centralized method to process the data by a sink node, which causes the load imbalance and “coverage hole” problems, and so forth. In this paper, we propose an energy-balanced data gathering and aggregating (EDGA) scheme that integrates a clustering hierarchical structure with the CS to optimize and balance the amount of data transmitted. We also design a data reconstruction algorithm to perform data recovery tasks by utilizing the orthogonal matching pursuit theory, which helps to reconstruct the original data accurately and effectively at sink node. The advantages of the proposed scheme compared with other state-of-the-art related methods are measured on the metrics of data recovery ratio and energy efficiency. We implement our scheme on a simulation platform using a real dataset from Intel lab. Simulation results demonstrate that the proposed data gathering and aggregating scheme guarantees accurate data reconstruction performance and obtains energy efficiency significantly compared to existing methods.
1. Introduction
Wireless sensor networks (WSNs) have received huge research focus for their wide applications, including battlefield surveillance, environmental monitoring, medical diagnosis, and wildlife exploration [1]. Due to the limitations of size and cost, sensors are usually equipped with limited and nonrechargeable power source. Therefore, how to save energy for prolonging network lifetime turns into an important issue in WSNs.
Data aggregation is considered as an effective method for WSNs to improve energy efficiency [2, 3]. It is the process of gathering and reconstructing the data from multiple sensors to eliminate redundant data and provide fused data to sink node [4]. Many data fusion approaches are proposed to solve this problem for WSNs. The cluster-based data aggregation is one of the most important kinds for its advantages on energy and load balance [5–7]. For cluster-based approaches, sensor nodes are grouped into some clusters and each cluster consists of a cluster-head (CH) and members. Every CH node gathers and aggregates the sensed data of its members and, then, sends the fused data to sink node directly or by the multihop route of CH node. Figure 1(a) shows a WSN where sensor nodes are densely deployed in the monitoring area to detect the environmental phenomena based on clustering architecture.

An example of data gather and aggregation in WSNs, (a) data gather and transmission in cluster-based network architecture, and (b) data aggregating from CH1 to sink.
Compressed sensing (CS) [8–11] is a collection of recently proposed sampling methods in Information Theory. The promise of compressed sensing is that it can obtain a sufficiently accurate approximation of an unknown data field by using a small number of generalized measurements, which are known as projections in the compressed sensing literature. The objective of data compressing is twofold: compress sensor readings to reduce global data traffic and distribute energy consumption evenly to prolong network lifetime. Figure 1(b) shows a round CS measurement gathering process from CH1 node to sink under multiple-hop topology. Each sensor sends data packets to its next-hop sensor.
When compressed sensing is applied to in-network data compression, it will bring a wealth of similar benefits as distributed source coding including simple encoding process, saving of internode data exchange, and decoupling of compression from routing. Furthermore, compressed sensing has two additional advantages. First, it can deal with abnormal sensor readings gracefully. Second, data reconstruction is not sensitive to packet losses. In compressed sensing, all messages received by sink node are equally important. However, in distributed source coding, received data are predefined as main or side information. Losing main information causes fatal errors to the decoder. All these desired merits make compressed sensing a promising solution to the data gathering and aggregation problem in large-scale wireless sensor networks.
In order to improve energy efficiency of wireless sensor network, we proposed an energy-balanced data gathering and aggregating scheme based on CS in [12]. In this paper, we improve this scheme using a cluster-based method to minimize and balance data traffic. The two important issues for our scheme are to divide the clusters with a lower traffic cost and reconstruct the original data with a lower data loss provability. So, we first propose a clustering method to minimize the data traffic in intracluster. Then, we use CS method to design a distributed data gathering and aggregating scheme in intercluster. Our scheme includes three steps.
Member nodes send data directly to the CH node using time division multiple access, as the data traffic is very low in intracluster. Once receiving data collected by their cluster, CH node runs the necessary data aggregation tasks using CS. CH node routes the processed data toward sink node over the relay nodes (CH).
In summary, this paper makes a number of major contribution as follows:
We study the network clustering problem for improving and balancing the energy efficiency of WSNs, which provides a new insight into the way we can get a distributed data gathering method. We propose an energy-balanced data gathering and aggregating (called EDGA) scheme based on cluster and compressive sensing theories. The data is gathered and aggregated in intracluster and intercluster, respectively, which reduce the traffic cost of network. We design a data reconstruction algorithm to recover the missing data in the incomplete dataset accurately. We evaluate our scheme via simulations and real dataset collected by Intel indoor test data. The results demonstrate that our scheme is significantly effective in terms of energy efficiency and data recovery ratio compared with existing methods.
The remainder of this paper is organized as follows. Section 2 surveys related work. Section 3 explains preliminaries and models and shows how to solve this problem exactly under considering the constraints. In Section 4, we design an energy-balanced data gathering and aggregating scheme based on compressed sensing theory. The performance of the proposed scheme is evaluated and compared with existing methods in Section 5. We conclude this paper in Section 6.
2. Related Work
With the emergence of compressed sampling theory [10, 13, 14], we have seen a new avenue of research in the field of in-network data compression. Compressed wireless sensing (CWS) [3] appears to be able to reduce the latency of data gathering in a single-hop network by delivering linear projections of sensor readings through synchronized amplitude modulated analog transmissions. Due to the difficulties in analog synchronization, CWS is less practical for large-scale sensor networks. Multichannel singular spectrum analysis (MSSA) [15] is a nonparametric and adaptive method based on the lag-covariance matrix. It is used to produce geographic data reconstruction.
Ji et al. [16] proposed a Bayesian compressed sensing (BCS) algorithm as dealing with images is high-dimensional problem. Chou et al. proposed an energy efficient information collection (EEIC) algorithm [14] to collect information from sensor networks taking into account both energy consumption and the amount of information in the sensing data. In an overview paper, Haupt et al. [17] also speculated the potential of using compressed sampling theory for data aggregation in a multihop WSN. However, no real scheme has been reported based on this initial idea. Xu et al. [18] proposed a compressed sparse functions approach to collect data through the use of CS techniques.
For improving the energy efficiency, some cluster-based methods were proposed to balancer energy dissipation. Xie and Jia [7] studied the relationship between the size of clusters and number of transmissions and proposed a centralized clustering algorithm to aggregate the sensed readings. Liu et al. [19] presented a new data aggregation method to reduce communication cost based on CS and cluster through the collection of a small number of samples at a data gathering point. The CH node processes the data based on a nonpersistent CSMA (Carrier-Sense Multiple Access) MAC protocol. This improved clustering method is based on the LEACH (Low-Energy Adaptive Clustering Hierarchy) [20], a clustering-based protocol that utilizes randomized rotation of local cluster base stations (CH) to evenly distribute the energy load among the sensors in the network.
Existing CS-based data reconstruction methods are not fully suitable for service-oriented wireless sensor networks due to two reasons: (1) most computations take place at sink node rather than sensors, which causes the load imbalance and “coverage hold” problems; (2) CS-based methods usually require the data to have inherent structures. Features that are extracted from internet traffic [21] are not applicable. To solve the above challenges, an effective data recovery method is required to improve the system performance in terms of data reconstruction accuracy, load balance, and network lifetime.
3. Preliminaries and Models
In this section, we first present the objectives of energy-balanced data gathering and aggregating scheme. Then, we state the related terminologies and problems definition.
3.1. Objectives
We assume the network has the following properties:
The network is modeled as a communication graph The sensing range of all the sensor nodes are modeled as the directional sensing model shown as in Figure 2. That is, the sensing range is a sector where the location of sensor is the center, and the maximal radius is The network is divided into some cluster grids ( The sensors are deployed in random and uniform way, and the whole network is time synchronized.

Directional sensing model.
In physical environmental data reconstruction system, sensors are deployed in the monitoring area. They sense and send data to sink node via CH nodes periodically over a given time slot. For simplicity, we suppose total
The timeline of system consists of some time round, shown as in Figure 3. The nodes gather and aggregate the targets or environment information data to sink node in working phase. Each working phase is divided into t time slots. After sensors are deployed, each sensor is required to report its sensory data once per time through multihop wireless communications.

Timeline of network.
Some mathematical notations that are used throughout this paper are given in Notations.
3.2. Definitions
To describe our problem clearly, we give some definitions in this subsection.
Definition 1 (sensing model).
Sensor's sensing model is defined as a quadruple
Definition 2 (coverage area (SA)).
Sensor's coverage area is defined by
Definition 3 (measurement matrix (MM)).
Measurement matrix is a mathematical method to describe the dynamic environment data sensor sensed. MM is defined by
So, MM is a matrix consisting of n rows and t columns. Each data element in the matrix is collected validly.
Definition 4 (reconstructed matrix (RM)).
Reconstructed matrix is generated by using the proposed data reconstruction algorithm (DR-OMP) to approximate MM. RM is defined by
3.3. Problem Definition
In WSNs, in order to improve communication efficiency, the sensed data is sent to sink node after compression using the traditional methods, which causes a larger data error between the reconstructed data and the original data. Based on the above definitions, our goal is to find an optimal
3.4. Energy Consumption Model
For the estimation of the transmission energy cost, we use a standard transmission model proposed by Heinzelman et al. [20]. Such a model assumes that the energy per bit for transmission over a wireless link is a function of the distance between a transmitter and a receiver. Let
4. Energy-Balanced Data Gathering and Aggregating Scheme
We design an energy-balanced data gathering and aggregating scheme in this section, EDGA for short. We first introduce network clustering for data gathering and process based on CS theory. Then, a data reconstruction algorithm based on Orthogonal Matching Pursuit is presented in detail, which is used to recover the missing data in the incomplete dataset accurately.
4.1. Calculating Coverage Area of Nodes for Clustering
Under the requirements of monitoring tasks, the whole monitoring area is divided into some virtual monitoring grids ( Key targets or points: each key monitoring target or point should be covered by at least one cluster. Node distribution: the scale of node should meet the requirements of network on coverage and connectivity [23]. Each cluster consists of some sensor nodes. We compare the network performance under different node distributions; that is, network is sparse, mediated, and dense environment. Network coverage: as the number of clusters decreases proportionately with the effective sensing radius of sensors, the cluster's size should be set lower when the effective sensing radius is larger. Network communication: any two nodes in the same cluster and CH nodes in adjacent clusters can be communicated reliably.
For any one sensor node

An illustration of node's coverage area.
4.2. Cluster-Head Election
In cluster-based sensor network, CH node consumes more energy compared with other member nodes. If a node takes the role of CH in multiple time round, its energy will be depleted over time, which easily causes the monitoring area that cannot be covered by nodes, that is, “coverage hole.” Therefore, it is necessary that CH be elected in each time round. The node with more residual energy should be considered as CH candidate. The CH election mechanism is based on a distributed algorithm. For any one node, it elects to be CH by sending election messages following a given probability p. To balance energy consumption of network, the probability of a node
In order to balance and save energy consumption of system, time round and state transition mechanism are used [25]. The lifeline of network is divided into some time rounds, shown as in Figure 3. Each round consists of initial phase and working phase. In initial phase, node
4.3. Data Gather and Aggregation
In the process of data gathering, the readings are transmitted from CH nodes to sink node as illustrated in Figure 1(b). Each reading proceeds with a weighted sum via each node along its communication route. First, CH1 sends
Finally, sink node can get M weighted sum
Based on CS theory, using the M weighted sum can reconstruct the N node's original data. The total number of transmission is
4.4. Spatial Correlated Data Recovery
According to compressed sensing theory, a sparse signal can be reconstructed from a small number of measurements with a probability close to one. An N-dimensional signal is considered as a M-sparse signal if there exists a domain in which this signal can be denoted by M (
With sufficient number of measurements, sink node can reconstruct sensor readings through solving an
4.5. Data Reconstruction Algorithm Design
We propose a data reconstruction algorithm based on Orthogonal Matching Pursuit to recovery original data, DR-OMP for short. The pseudo-code of DR-OMP is shown as Algorithm 1.
an N-dimensional residual (1) (2) (3) (4) (5) Determine the orthogonal projector (6) (7) (8) (9) (10)
Our DR-OMP sparse approximation algorithm is used for data reconstruction. To identify the ideal data s, we need to determine which column of ϕ participates in the measurement vector v. The idea behind the algorithm is to pick columns in a greedy fashion. At each iteration, we choose the column of ϕ that is most strongly correlated with the remaining part of v. Then, we subtract its contribution to v and iterate
We analyze the complexity of Algorithm 1. The key operation is the procedure for computing the inverse matrix, which provide the best approximate result. This procedure is completed by a matrix multiplication. Therefore, its time complexity is
5. Performance Evaluations
In this section, simulations and experiments based on real dataset are conducted to evaluate our design under different network settings and reveal insights of system performance.
5.1. Simulation Settings and Dataset
We first conduct simulations using OMNET++ 4.2 [31] and MATLAB to evaluate our framework and algorithm. The sensors are deployed on 200 m × 200 m randomly and uniformly (
Simulation parameters.
The second experiment is based on a real WSN system from the Intel Berkeley Research lab [32]. The real data are gathered by Intel indoor experiment built on TinyOS platform. Mica2Dot sensors with weather boards collected timestamped topology information, along with temperature, humidity, and light once every 31 seconds.
To verify the performance of EDGA scheme, we compare it with other approaches, which include compressed sensing (CS) [33], multichannel singular spectrum analysis (MSSA) [15], energy efficient information collection (EEIC) [14].
The performance of proposed EDGA scheme is evaluated with respect to different system parameters, for example, cluster number (k), network lifetime, and data loss probability. Through these experiments, comparison between different benchmarks and our proposed EDGA scheme is used to demonstrate the performance of our design.
5.2. Experimental Results and Analysis
In the first set of simulations, we implement our method considering two methods MSSA and EEIC in terms of network lifetime, the number of cluster (k), and residual energy. Table 2 lists the cluster numbers and its corresponding sizes. Three node distributions represent the network that is sparse, mediated, and dense when the cluster number
Three types of cluster size.
Figures 5(a)–5(c) demonstrate the network lifetime under different node distributions. In the sparse network settings, our proposed EDGA scheme increases by

Comparison of the three algorithms in terms of network lifetime.

Comparison on the residual energy of nodes (
Figure 7 shows the affection of the cluster size to the network lifetime. When cluster number

Comparison of network lifetimes under different cluster sizes (
In the second set of simulations, we implement our algorithm considering three methods CS, MSSA, and EEIC in terms of error ratio on data reconstruction. Error rate (η) is defined as a metric for measuring the reconstruction error after interpolation shown as
Figures 8(a)–8(c) plot the error ratio on indoor temperature, humidity, and light. The data loss rate ranged from

Comparison of the four algorithms under different data loss probabilities.
Figures 8(b) and 8(c) show the error ratio on humidity and light. The performance of EDGA scheme on humility and light are similar with that on temperature. EDGA scheme can achieve the best performance on environment humility and light data reconstruction among the four algorithms. However, the advantage of EDGA on humidity and light is less significant than that on temperature data reconstruction for the reason that the fluctuation of humility and light are not large in a short interval.
Therefore, EDGA scheme achieves a better performance compared with the other three algorithms in terms of network lifetime and error ratio of data recovery.
6. Conclusion
In this paper, our intention was to demonstrate a new way of data gathering and aggregation in wireless sensor networks. First, we studied the network clustering issue for optimizing and balancing the energy efficiency of network, which provides a new insight into the way we can get a distributed data gathering method. Second, we proposed an energy-balanced data gathering and aggregating scheme to improve data reconstruction accuracy based on compressed sensing theory, where observed correlations between nodes have been used to reconstruct the original data, and reduce the traffic cost of network accordingly. Third, we evaluated the proposed scheme via both simulations and a real database to validate its effectiveness and efficiency. The results demonstrated that our scheme achieved better reconstruction accuracy with less than 20% error in face of 90% data missing probability.
This work leaves at least two open issues in the future. One issue is to exploit the space and time correlations between data factors for improving the accuracy of data reconstruction. Another issue is how to optimize the accuracy and complexity on data reconstruction. This may help to meet both QoS and energy efficiency requirements in wireless sensor networks.
Footnotes
Notations
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work is supported by China Postdoctoral Science Foundation under Grant no. 2014M562153, Guangzhou Education Bureau Science Foundation under Grant no. 1201430560, the Key Program of NSFC-Guangdong Union Foundation under Grant no. U1135002, and the Key Program of Guangdong Technology Innovation under Grant no. CXZD1144.
