Sage Journals: Discover world-class research

Abstract

Possibilistic c-means clustering algorithm (PCM) has emerged as an important technique for pattern recognition and data analysis. Owning to the existence of many missing values, PCM is difficult to produce a good clustering result in real time. The paper proposes a distributed weighted possibillistic c-means clustering algorithm (DWPCM), which works in three steps. First the paper applies the partial distance strategy to PCM (PDPCM) for calculating the distance between any two objects in the incomplete data set. Further, a weighted PDPCM algorithm (WPCM) is designed to reduce the corruption of missing values by assigning low weight values to incomplete data objects. Finally, to improve the cluster speed of WPCM, the cloud computing technology is used to optimize the WPCM algorithm by designing the distributed weighted possibilistic c-means clustering algorithm (DWPCM) based on MapReduce. The experimental results demonstrate that the proposed algorithms can produce an appropriate partition efficiently for incomplete big sensor data.

1. Introduction

Recent years have witnessed the deployments of sensor networks for many critical applications such as Internet of Things (IoT) [1, 2], environment monitoring [3], and target detection [4, 5]. With the rapid advent of various sensor networks, more and more mobile devices, such as smart phones and RFID, can sense and collect a huge number of sensor data. For example, data sampled from millions of sensors deployed sensor networks for environment monitoring can exceed hundreds of terabytes data every day. We are moving toward the era of big sensor data which requires efficiently advanced data analysis and mining tools [6].

As an important data mining tool, the possibilistic c-means cluster algorithm (PCM) has emerged as an important technique for pattern recognition and data analysis, proposed by Krishnapuram and Keller [7]. Many PCM variants have been proposed after standard PCM for improving the performance of the original PCM algorithm. Zhang and Leung applied the fuzzy method to PCM for an improved PCM algorithm, which enhances the robustness of the possibilistic approach [8]. In 2011, Yang and Lai proposed a robust merging approach and created the automatic merging possibilistic clustering method to obtain the number of classes automatically. However, this algorithm could not always obtain the most reasonable number of clusters [9]. Pal proposed FPCM and PFCM, which is a combined way of doing PCM with FCM, avoiding the coincidental cluster [10, 11]. In 2008, Xie et al. proposed an enhanced possibilistic c-means clustering algorithm, which partitioned the dataset into the main cluster and the assisted cluster to avoid producing the coincidental cluster [12].

PCM can be extensively used in big sensor data analysis and mining. However, in big sensor data, many data sets suffer from incompleteness; that is, a data set X can contain vectors that miss one or more of the attribute values [13]. PCM could not succeed completely in clustering such incomplete data sets in real time. On the one hand, PCM could not calculate the distance between two objects in incomplete data sets. Meanwhile the accuracy of PCM is easily corrupted by incomplete objects. On the other hand, PCM is difficult to satisfy the real-time requirement of clustering incomplete big sensor data due to the huge amount of data.

The paper proposes a distributed weighted possibilistic c-means algorithm (DWPCM) for clustering incomplete big sensor data. First, the paper applies the partial distance strategy (PDS) [14] to PCM (PDPCM) for calculating the distance between two objects in incomplete data set using all available attribute values. Further, a weighted PDPCM algorithm (WPCM) is designed to reduce the corruption of missing values. In WPCM, a novel method for determining weight values is presented. Finally, the cloud computing technology is used to optimize the WPCM algorithm by designing the distributed weighted possibilistic c-means algorithm (DWPCM) based on MapReduce, which aims at improving the cluster speed of WPCM.

To evaluate the performance of the proposed algorithms, we implement the proposed methods on two real big sensor data sets. The experimental results demonstrate that the proposed algorithms can produce an appropriate partition efficiently for incomplete big sensor data.

The rest of this paper is structured as follows. Section 2 resumes some related research on possibilistic c-means clustering algorithms and the cluster algorithms based on cloud computing. Section 3 depicts the WPCM algorithm and Section 4 describes the DWPCM algorithm based on MapReduce. The performance evaluation is illustrated in Section 5. The paper is concluded in Section 6.

2. Related Work

2.1. Possibilistic c-Means Algorithms

PCM partitions an m-dimensional dataset $X = {x_{1}, x_{2}, \dots, x_{n}}$ into several clusters to describe an underlying structure within the data. A possibilistic partition is defined as a $c \times n$ matrix $U = {u_{i j}}$ , where $u_{i j}$ is the membership value of object $x_{j}$ towards the ith cluster, c is the number of clusters, and n is the number of objects. PCM obeys the constraint as follows:

\begin{matrix} u_{i j} \in [0,1] \forall i, j, 0 < \sum_{j = 1}^{n} u_{i j} \leq N \forall i, \\ \max_{i} u_{i j} > 0 \forall j . \end{matrix}

(1)

PCM minimizes the following objective function [7]:

\begin{array}{l} J_{m} (U, V) = \sum_{i = 1}^{c} \sum_{j = 1}^{n} u_{i j}^{m} {∥ x_{k} - v_{i} ∥}^{2} \\ + \sum_{i = 1}^{c} η_{i} \sum_{j = 1}^{n} {(1 - u_{i j})}^{m} . \end{array}

(2)

In (2), $V = (v_{1}, v_{2}, \dots, v_{c})$ is a C-tuple of prototypes and $U = [u_{i j}]$ is a $c * n$ matrix, called the possibilistic c-partition matrix, satisfying the conditions in (1). Here, $m > 1$ is a fuzzification constant and $η_{i}$ is a suitable positive number. The first term demands that the distances from the object to the prototypes be as low as possible, whereas the second term forces the $u_{i j}$ to be as large as possible, aiming at avoiding the trivial solution.

Solving the minimization problem (2) yields membership functions of the form

\begin{matrix} u_{i j} = \frac{1}{1 + {(d_{i j} / η_{i})}^{1 / (m - 1)}} . \end{matrix}

(3)

The cluster centers are updated using

\begin{matrix} v_{i} = \frac{\sum_{j = 1}^{n} u_{i j}^{m} x_{j}}{\sum_{j = 1}^{n} u_{i j}^{m}} . \end{matrix}

(4)

The procedure of PCM can be described as follows.

Step 1.

Choose m, c, and $ε > 0$ and then initialize the membership matrix $U^{(0)}$ .

Step 2.

Update cluster centers using (4).

Step 3.

Estimate $η_{i}$ using the following formula:

\begin{matrix} η_{i} = \frac{\sum_{j = 1}^{n} u_{i j}^{m} d_{i j}^{2}}{\sum_{j = 1}^{n} u_{i j}^{m}} . \end{matrix}

(5)

Step 4.

Update membership matrix U using (3).

Step 5.

If $ε \leq {∥ u_{i j} - u_{i j}^{'} ∥}^{2}$ , stop; else repeat Step 2.

PCM is rational and it has a meaningful interpretation. What is more important is that PCM is robust, because a noisy object would belong to the clusters with small memberships, and consequently it cannot corrupt the resulting clusters significantly [8].

In the PCM algorithm, each object is considered equally important in the clustering solution. The weighted PCM (wPCM) model introduces weights to define the relative importance of each object in the clustering solution [15, 16]. Assume that $w \in R^{n}$ , $w_{j} \geq 0$ , is a set of weight values that define the influence of each object, which results in the WPCM objective function of the form

\begin{array}{l} J_{m} (U, V) = \sum_{i = 1}^{c} \sum_{j = 1}^{n} u_{i j}^{m} {∥ x_{k} - v_{i} ∥}^{2} \\ + \sum_{i = 1}^{c} η_{i} \sum_{j = 1}^{n} {(w_{j} - u_{i j})}^{m} . \end{array}

(6)

In order to find clusters with nonspherical shapes, kernel-based PCM algorithms are proposed, which mapped original data into higher dimensional feature space [17]. Given a kernel function k, the kernel PCM algorithm (kPCM) minimizes the following objective function:

\begin{matrix} J_{m} (U, V; k) = \sum_{i = 1}^{c} \sum_{j = 1}^{n} u_{i j}^{m} Φ_{i j}^{2} + \sum_{i = 1}^{c} η_{i} \sum_{j = 1}^{n} {(1 - u_{i j})}^{m}, \end{matrix}

(7)

where

Φ_{i j}

is the kernel-based distance between the ith object and the jth object.

Even though these methods are performed well in the cluster process, they cannot succeed completely in clustering incomplete big sensor data because of the corruption of missing values. The paper proposed a weighted PCM algorithm (WPCM), which applies the partial distance to PCM for calculating the distance between two objects in incomplete data set and then assigns low weight values to incomplete data object for reducing the corruption of missing values.

2.2. Cluster Algorithms Based on Cloud Computing

Cloud computing has emerged as a significant technology to deal with big data in time by leveraging vast amounts of computing resources available on demand with low resource usage cost [18, 19]. As the key technology of cloud computing, MapReduce [20] is a highly efficient distributed programming model for large-scale data sets in parallel computing.

Many cluster algorithms based on MapReduce have been proposed to improve the cluster efficiency. For example, Zhao [21] proposed parallel k-means algorithm based on MapReduce to cluster massive data fast. Similar algorithms include MapReduce-kCenter algorithm and MapReduce-kMedian algorithm, which used iterative sampling technology to get good performance for clustering very large data [22, 23]. Another example is hierarchical clustering algorithm based on cloud computing, in which MapReduce is used to optimize the hierarchical clustering algorithm for processing large-scale data [24, 25]. Well-known cluster algorithms based on MapReduce, such as HDBSCAN [26] and MR-DBSCAN [27], reduce I/O access frequency and spatial complexity for speeding up the density clustering. For affinity propagation clustering algorithm, published in science magazine proposed by Frey and Dueck [28], Lu et al. [29] proposed a distributed AP clustering algorithm based on MapReduce (DisAP), which includes three MapReduce stages and achieved high performance on both scalability and accuracy. In 2012, Yang et al. [30] presented a MapReduce-based MST text clustering algorithm, which used cloud computing technology to improve the performance of the graph clustering. Yu and Dai [31] proposed a parallel fuzzy c-means algorithm based on MapReduce for improving the standard fuzzy c-means algorithm. Other parallel clustering algorithms based on cloud computing can be found in [32–34].

Even though these algorithms perform their job well, they all focus on crisp clustering. The paper uses cloud computing technology to accelerate the cluster speed of WPCM by designing a distributed WPCM algorithm (DWPCM) based on MapReduce to produce the possibilistic clustering for big data in real time.

3. Weighted PCM Algorithm

3.1. PCM Based on Partial Distance for Clustering Incomplete Data

In this subsection, the paper applies partial distance to PCM for clustering incomplete data set. Partial distance is used to calculate the distance between an object $x_{k}$ and the ith cluster center $v_{i}$ as follows:

\begin{matrix} P D_{i k} = \frac{m}{I_{k}} \sqrt{\sum_{j = 1}^{m} {(x_{k j} - v_{i j})}^{2} I_{k j}} \\ I_{k j} = {\begin{cases} 0, & if x_{k j} = * \\ 1, & otherwise \end{cases} \end{matrix}

(8a)

\begin{matrix} for 1 \leq j \leq m, 1 \leq k \leq n; \\ I_{k} = \sum_{j = 1}^{m} I_{k j} . \end{matrix}

(8b)

From (8a) and (8b), partial distance makes full use of attribute information of both complete data and incomplete data to calculate the distance between two objects.

The PCM algorithm based on partial distance (PDPCM) is obtained by making two modifications of PCM: (1) calculating $d_{i j}$ for incomplete data according to (8a) and (8b) and (2) updating the cluster centers with

\begin{matrix} v_{i j} = \frac{(\sum_{k = 1}^{n} u_{i k}^{m} I_{k j} x_{k j})}{(\sum_{k = 1}^{n} u_{i k}^{m} x_{k j})} . \end{matrix}

(9)

This algorithm enjoys all the standard convergence properties of PCM because it is an instance of alternating optimization [7].

For an m-dimensional incomplete data set, the procedure of the PCM algorithm for clustering incomplete data based on partial distance can be described as follows.

Step 1.

Choose m, c, and $ε > 0$ and then initialize the membership matrix $U^{(0)}$ .

Step 2.

Update cluster centers using (9).

Step 3.

Estimate $η_{i}$ using (5).

Step 4.

Update membership matrix U using (3).

Step 5.

If $ε \leq {∥ u_{i j} - u_{i j}^{'} ∥}^{2}$ , stop; else repeat Step 2.

3.2. Weighted PCM Algorithm

Even though the PCM algorithm based on partial distance given earlier can cluster incomplete data sets, it is difficult to produce a good partition due to the effect of incomplete objects. The paper proposes a weighted PCM algorithm (WPCM), which assigns low weight values to incomplete objects for reducing the corruption of incomplete objects on the cluster process.

To minimize the objective function, WPCM updates the membership values and clusters centers using the following equation:

\begin{matrix} η_{i} = \frac{\sum_{j = 1}^{n} w_{j} u_{i j}^{m} d_{i j}^{2}}{\sum_{j = 1}^{n} w_{j} u_{i j}^{m}}, \end{matrix}

(10)

\begin{matrix} u_{i j} = \frac{w_{j}}{1 + {(d_{i j} / η_{i})}^{1 / (m - 1)}}, \end{matrix}

(11)

\begin{matrix} v_{i} = \frac{\sum_{j = 1}^{n} w_{j} u_{i j}^{m} x_{j}}{\sum_{j = 1}^{n} w_{j} u_{i j}^{m}} . \end{matrix}

(12)

The $w_{j}$ weights can be viewed as strength terms describing the strength of membership of $x_{j}$ in any cluster [16]. Recently, a large number of algorithms have been proposed to determine the weight of the object. Perhaps the most representative method for detecting the outlier and reducing their effect on the cluster process in PCM was proposed by Schneider [16], which determines the weight of each object depending on the degree of belonging of each feature vector in any cluster. The equation for the weight values is

\begin{matrix} s_{j} = \sum_{k = 1}^{c} e^{- α {∥ x_{j} - v_{k} ∥}^{2}}, j = 1,2, \dots, n, \end{matrix}

(13)

where

a > 0

is a suitably chosen constant.

Equation (13) can detect the outlier effectively. However, it cannot reduce the effect of incomplete objects on the cluster process. The paper redefines the weights in

\begin{matrix} w_{j} = {(1 - \frac{l_{j}}{m})}^{t} \sum_{k = 1}^{c} e^{- α {∥ x_{j} - v_{k} ∥}^{2}}, j = 1,2, \dots, n, \end{matrix}

(14)

where m is the number of the features of the data object, trepresents the iterative times, and

l_{j}

is the number of missing feature values of the data object

x_{j}

. Equation (14) reduces the corruption of incomplete objects on the cluster process by the coefficient

(1 - l_{j} / m)

. A large

l_{j}

value of

x_{j}

, which indicates that

x_{j}

has many missing feature values, will result in a low weight value. In (14), t increases with the times of the cluster iteration increasing, which can accelerate the convergence of the clustering process.

The steps of the WPCM algorithm are outlined as follows.

Step 1.

Choose m, c,and $ε > 0$ and then initialize the cluster centers $V^{(0)}$ and the membership matrix $U^{(0)}$ .

Step 2.

Calculate the weight values using (14).

Step 3.

Estimate $η_{i}$ using (10).

Step 4.

Update membership matrix U using (11).

Step 5.

Update the cluster centers V using (12).

Step 6.

If $ε \leq {∥ u_{i j} - u_{i j}^{'} ∥}^{2}$ , stop; else repeat Step 2.

3.3. Time Complexity

In this subsection, we discuss the time complexity of WPCM. All operations are counted as unit costs. We do not assume time economies that might be realized by special programming tricks or properties of the equations involved. We use the following notation in our discussion:

i is the number of iterations of WPCM over the full data set;

n is the number of data objects;

f is the number of dimensions;

c is the number of clusters.

From the cluster process of the WPCM algorithm, the time complexity of this algorithm is dominated by Steps 4 and 5. The calculation of membership matrix using (14) requires $O (n c f)$ operations per cluster, which is also required by updating the cluster centers, resulting in a total time complexity of $O (i n c f)$ for WPCM.

4. Distributed Weighted PCM Algorithm Based on MapReduce

In Section 3, we propose a weighted PCM algorithm based on partial distance, which can cluster incomplete data effectively. However, WPCM has a low efficiency for clustering incomplete big sensor data owning to the huge number of data and its high time complexity.

To accelerate the cluster speed of WPCM, the paper proposes a distributed WPCM algorithm (DWPCM) based on MapReduce in this section.

From the steps of WPCM, there are two major operations: calculating the degree of membership $u_{i j}$ and calculating the clustering centers $v_{i}$ .

In the map phase, the Map function is designed to calculate the degree of membership $u_{i j}$ .

To reduce the communication cost of the distributed algorithm, the paper partitions the membership matrix into p blocks by columns, each block with $n / p$ columns, where n is the number of the data objects and p is the number of the data nodes in a cloud platform. After partition, the paper puts each block of the membership matrix in a data node to calculate.

In order to update cluster centers in parallel, two parameters, $ξ_{i}^{(t)}$ and $λ_{i}^{(t)}$ , are introduced, where t represents the serial number of data node. After calculating the membership $u_{i j}$ , the Map function calculates $ξ_{i}^{(t)}$ and $λ_{i}^{(t)}$ using

\begin{matrix} ξ_{i}^{(t)} = \sum_{k = 1}^{n / p} w_{k} u_{i k}^{m} x_{k} i = 1,2, \dots, c \end{matrix}

(15)

\begin{matrix} λ_{i}^{(t)} = \sum_{k = 1}^{n / p} w_{k} u_{i k}^{m} i = 1,2, \dots, c . \end{matrix}

(16)

Finally, the Map function outputs $c 〈 k e y^{'}, v a l u e^{'} 〉$ , where c is the number of classes, $k e y^{'}$ represents the identifier of the class, and $v a l u e^{'}$ is a vector that consists of $ξ_{k e y^{'}}^{(t)}$ and $λ_{k e y^{'}}^{(t)}$ .

In the reduce phase, the Reduce function is designed to calculate the clustering centers $v_{i}$ .

The input of the Reduce function is a $〈 k e y^{'}, l i s t 〉$ , where $k e y^{'}$ is the identifier of the class and $l i s t$ includes all of value's with the same $k e y^{'}$ obtained from the Map function. The Reduce function is responsible to compute the cluster centers according to

\begin{matrix} v_{i} = \frac{\sum_{t = 1}^{p} ξ_{i}^{(t)}}{\sum_{t = 1}^{p} λ_{i}^{(t)}}, \end{matrix}

(17)

where p is the number of the data nodes and i is the identifier of the class, which have the same explanation with

k e y^{'}

Similar to Section 3.3, we discuss the time complexity of DWPCM now. The notation used in our discussion is the same as that in Section 3.3. The time complexity of DWPCM is approximately $O (i n c f / p)$ , where p is the number of data nodes in the cloud computing platform. Note that DWPCM sends the clusters and the weight values to all the data nodes in each iteration, which may cause communication overhead. However, the communication takes significantly less time than the calculation on the cluster process, especially in a centralized cloud computing platform where all cloud computing nodes are in the same location. So we ignore the communication overhead when estimating time complexity.

5. Experiments

In order to evaluate the efficiency and effectiveness of the proposed algorithms, we perform the algorithms on the two real data sets. The experimental setup and dataset are described first, followed by the results.

5.1. Experimental Setup and Dataset

The experimental environment consists of 20 computers as cloud computing nodes, each of which has an Intel Core i7 processor with 3.2 GHz speed, 8 GB RAM, and a 2 TB hard drive.

We apply the algorithms on two real data sets in the experiments. The first data set is an extension of “Gas Sensor Array Drift Dataset at Different Concentrations Data Set” which is available in UCI Machine Learning Repository [35, 36]. This data set contains 3 × 10⁹ objects with 129 numerical attributes, called eGSAD. The second data set consists of 2 × 10¹⁰ objects sampled from the smart WSN lab, called sWSN, whose size is about 1 TB. The parameters used for the proposed algorithms are $ε = 10^{- 3}$ and $m = 1.75$ , which can help to produce a good cluster result [37].

We first artificially create missing values in the data sets for simulating incomplete data sets and then cluster them using the proposed algorithms. The performance of the proposed algorithms is evaluated by comparing their cluster results to the result of PCM for clustering original data sets from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to cluster the data sets, the effectiveness is related to the cluster accuracy.

Since the cluster performance depends on the amount of missing values, we artificially create six kinds of missing ratios, which are 1%, 3%, 5%, 10%, 15%, and 20% objects with missing values. For every missing ratio, we generate 5 different incomplete data sets for Gisette and sWSN. Specifically, any two data sets can have different missing values for every missing ratio.

5.2. Experimental Results on Cluster Accuracy

In order to assess the effectiveness of WPCM, two well-known evaluation criteria, $E_{*}$ and Adjusted Rand Index (ARI), are used in the experiment [37, 38].

The first evaluation criterion, $E_{*}$ , is used to assess the error between ideal cluster centers and cluster centers produced by a specific algorithm, which is calculated according to

\begin{matrix} E_{*} = \sqrt{\sum_{i = 1}^{c} {∥ v_{ideal}^{i} - v_{*}^{i} ∥}^{2}}, \end{matrix}

(18)

where

v_{ideal}^{i}

represents the ith ideal cluster center and

v_{*}^{i}

denotes the ith cluster center produced by a specific algorithm *. A lower value

E_{*}

of indicates that the algorithm produces more accurate cluster centers.

The other evaluation criterion, $ARI (U, U^{'})$ , is used to measure the agreement between two possibilistic partitions of a set of objects, where U represents the ground truth labels for the objects in the data set and $U^{'}$ denotes a partition produced by a specific algorithm. A higher value of $ARI (U, U^{'})$ indicates that the algorithm produces a better cluster result.

To eliminate the variation in the results from trial to trial, Tables 1 and 2 present the average values of $E_{*}$ and $ARI$ obtained over 10 trials on incomplete eGASD and sWSN data sets, and the same incomplete data set is used in each trail for each of the three approaches so that the results can be correctly compared.

Table 1

Averaged results of 10 trials on the eGASD set in terms of $E_{*}$ .

Missing ratio	PDPCM	WPCM	DWPCM
1%	36.24	13.09	13.09
3%	42.98	17.26	17.26
5%	48.85	20.12	20.12
10%	51.07	24.78	24.78
15%	59.69	30.24	30.24
20%	66.94	32.58	32.58

Table 2

Averaged results of 10 trials on the sWSN set in terms of $E_{*}$ .

Missing ratio	PDPCM	WPCM	DWPCM
1%	1.19	0.54	0.54
3%	1.52	0.69	0.69
5%	1.82	0.92	0.92
10%	2.07	1.15	1.15
15%	2.48	1.27	1.27
20%	2.97	1.54	1.54

We present the cluster accuracy of PCM, PCM based on partial distance (PDPCM), WPCM, and DWPCM on the two data sets in terms of $E_{*}$ for 6 missing ratios in Tables 1 and 2.

From Tables 1 and 2, as the missing ratio increases, the average values of $E_{*}$ of three algorithms increase, which argues that the cluster accuracy is affected by missing ratios. In terms of $E_{*}$ , WPCM always performs better than PDPCM because the average $E_{*}$ value of WPCM is lower than that of PDPCM for 6 missing ratios, which demonstrates that the cluster prototypes obtained by WPCM are closer to the actual ones. DWPCM produces the same result as WPCM based on $E_{*}$ because the two algorithms use the same methods to calculate the weight values, the cluster prototypes, and the membership matrix.

To calculate the $ARI$ , we first harden the possibilistic partitions by setting the maximum element in each column of U to 1 and all else to 0. Tables 3 and 4 show the average values of $ARI (U, U^{'})$ obtained by PDPCM, WPCM, and DWPCM.

Table 3

Averaged results of 10 trials on eGASD in terms of ARI.

Missing ratio	PDPCM	WPCM	DWPCM
1%	0.9561	0.9893	0.9893
3%	0.9347	0.9754	0.9754
5%	0.9021	0.9556	0.9556
10%	0.8684	0.9331	0.9331
15%	0.8249	0.9092	0.9092
20%	0.7627	0.8861	0.8861

Table 4

Averaged results of 10 trials on sWSN in terms of ARI.

Missing ratio	PDPCM	WPCM	DWPCM
1%	0.9236	0.9653	0.9653
3%	0.8742	0.9238	0.9238
5%	0.8426	0.8958	0.8958
10%	0.8182	0.8630	0.8630
15%	0.7587	0.8204	0.8204
20%	0.7121	0.7862	0.7862

From the results shown in Tables 3 and 4, WPCM produces better partitions than PDPCM for 6 different missing ratios of the two data sets in terms of ARI. DWPCM still produces the same result as WPCM based on ARI.

5.3. Experimental Results on Execution Time

We use execution time and scalability to evaluate the efficiency of DWPCM. The average execution time of PDPCM, WPCM, and DWPCM on the two data sets for different number of objects is shown in Figures 1 and 2.

Figure 1

Average execution time on eGASD.

Figure 2

Average execution time on sWSN.

From Figures 1 and 2, even though the average execution time of the three algorithms increases with the number of objects increasing, DWPCM takes the least time of the three algorithms for the two data sets. Especially when the data set is very big, the execution time required by DKPCM is significantly less than the other two algorithms, which demonstrates that DWPCM performs most efficiently for clustering big sensor data.

In order to test the scalability of DWPCM, we perform the algorithm for clustering the two data sets in the different cloud computing platforms, in which there are 5 nodes, 10 nodes, 15 nodes, and 20 nodes, respectively. The result is shown in Figure 3.

Figure 3

Computation speed of DWPCM.

From Figure 3, the execution time of DWPCM for clustering the two data sets reduces gradually with the increasing number of data nodes in the cloud computing platform, which demonstrates that adding nodes can significantly improve the system capacity. Therefore, DWPCM has a good scalability, especially when the data set is big.

6. Conclusion and Future Work

The paper proposes a distributed weighted PCM algorithm for clustering incomplete big sensor data. The proposed algorithm applies partial distance strategy to PCM (PDPCM) for calculating the distance between any two objects in the incomplete data set. Further, based on PDPCM, the paper designs a weighted PCM algorithm (WPCM) to reduce the corruption of incomplete objects on the cluster process. Another unique property of the proposed algorithm is the use of the cloud computing technology. Cloud computing is used to optimize WPCM to provide a significant computation speed, which is very important for big sensor data real-time clustering and analysis.

The experiments demonstrate that WPCM produces a good cluster result based on both evaluation criteria, namely, $E_{*}$ and $ARI$ . As for efficiency, DWPCM performs better than WPCM, which takes significantly less time than WPCM for clustering incomplete big sensor data. Note that DWPCM produces the same cluster result as WPCM in terms of $E_{*}$ and $ARI$ , which demonstrates that DWPCM improves the cluster efficiency without reducing the cluster accuracy.

In the future research work, we will investigate a further improvement of DWPCM to improve the effectiveness and efficiency of clustering big sensor data with many missing values. Additionally, for many semistructured and unstructured data in big sensor data, our future research plans will modify DWPCM to cluster the two types of incomplete data into appropriate groups.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by Project U1301253 of NSFC and Project 201202032 of Liaoning Provincial Natural Science Foundation of China.

References

Feki

M. A.

Kawsar

Boussard

Trappeniers

The internet of things: the next technological revolution

Computer 2013 46 2 24 25

10.1109/MC.2013.63

Atzori

Iera

Morabito

The internet of things: a survey

Computer Networks 2010 54 15 2787 2805

2-s2.0-77956877124

10.1016/j.comnet.2010.05.010

Liu

Chen

Nonthreshold-based event detection for 3D environment monitoring in sensor networks

IEEE Transactions on Knowledge and Data Engineering 2008 20 12 1699 1711

2-s2.0-55949105799

10.1109/TKDE.2008.114

Brass

Bounds on coverage and target detection capabilities for models of networks of mobile sensors

ACM Transactions on Sensor Networks 2007 3 2, article 9

2-s2.0-34250312116

10.1145/1240226.1240229

1240229

Tan

Xing

Liu

Wang

Jia

Exploiting data fusion to improve the coverage of wireless sensor networks

IEEE/ACM Transactions on Networking 2012 20 2 450 462

2-s2.0-84859964364

10.1109/TNET.2011.2164620

Zhu

Gong-Qing

Ding

Data mining with big data

IEEE Transactions on Knowledge and Data Engineering 2014 26 1 97 107

10.1109/TKDE.2013.109

Krishnapuram

Keller

J. M.

A possibilistic approach to clustering

IEEE Transactions on Fuzzy Systems 1993 1 2 98 110

2-s2.0-0027595430

10.1109/91.227387

Zhang

J.-S.

Leung

Y.-W.

Improved possibilistic C-means clustering algorithms

IEEE Transactions on Fuzzy Systems 2004 12 2 209 217

2-s2.0-1942532748

10.1109/TFUZZ.2004.825079

Yang

M.-S.

Lai

C.-Y.

A robust automatic merging possibilistic clustering method

IEEE Transactions on Fuzzy Systems 2011 19 1 26 41

2-s2.0-79551622290

10.1109/TFUZZ.2010.2077640

10.

Barni

Cappellini

Mecocci

Comments on a possibilistic approach to clustering

IEEE Transactions on Fuzzy Systems 1996 4 3 393 396

10.1109/91.531780

11.

Pal

N. R.

Pal

Keller

J. M.

Bezdek

J. C.

A possibilistic fuzzy C-means clustering algorithm

IEEE Transactions on Fuzzy Systems 2005 13 4 517 530

2-s2.0-26844532803

10.1109/TFUZZ.2004.840099

12.

Xie

Wang

Chung

F. L.

An enhanced possibilistic C-means clustering algorithm EPCM

Soft Computing—A Fusion of Foundations, Methodologies and Applications 2008 12 6 593 611

2-s2.0-38749134116

10.1007/s00500-007-0231-6

13.

Zhang

A fuzzy C-means clustering algorithm based on nearest-neighbor intervals for incomplete data

Expert Systems with Applications 2010 37 10 6942 6947

2-s2.0-78649930585

10.1016/j.eswa.2010.03.028

14.

Hathaway

R. J.

Bezdek

J. C.

Fuzzy C-means clustering of incomplete data

IEEE Transactions on Systems, Man, and Cybernetics, B: Cybernetics 2001 31 5 735 744

2-s2.0-0035481296

10.1109/3477.956035

15.

Liu

Xia

S. X.

Zhou

A sample-weighted possibilistic fuzzy clustering algorithm

Acta Electronica Sinica 2012 40 2 371 375

16.

Schneider

Weighted possibilistic C-means clustering algorithms

IEEE Transactions on Fuzzy Systems 2000 1 176 180

10.1109/FUZZY.2000.838654

17.

Filippone

Masulli

Rovetta

Applying the possibilistic C-means algorithm in kernel-induced spaces

IEEE Transactions on Fuzzy Systems 2010 18 3 572 584

2-s2.0-77953110999

10.1109/TFUZZ.2010.2043440

18.

Armbrust

Fox

Griffith

Joseph

A. D.

Katz

Konwinski

Lee

Patterson

Rabkin

Stoica

Zaharia

A view of cloud computing

Communications of the ACM 2010 53 4 50 58

2-s2.0-77950347409

10.1145/1721654.1721672

19.

Zhang

Chen

Yang

L. T.

A nodes scheduling model based on Markov chain prediction for big streaming data analysis

International Journal of Communication Systems 2014

10.1002/dac.2779

20.

Dean

Ghemawat

MapReduce: simplified data processing on large clusters

Communications of the ACM 2008 51 1 107 113

2-s2.0-37549003336

10.1145/1327452.1327492

21.

Zhao

Parallel k-means clustering based on mapreduce

Proceedings of the 1st International Conference on Cloud Computing

2009

Berlin, Germany

Springer

674 679

10.1007/978-3-642-10665-1_71

22.

Ene

Moseley

Fast clustering using MapReduce

Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2011

New York, NY, USA

ACM

681 689

2-s2.0-80052666000

10.1145/2020408.2020515

23.

Bahmani

Moseley

Vattani

Kumar

Vassilvitskii

Scalable K-means+

Proceedings of the VLDB Endowment 2012 5 7 622 633

24.

Sun

Shu

Fang

An efficient hierarchical clustering method for large datasets with map-reduce

Proceedings of the International Conference on Parallel and Distributed Computing, Applications and Technologies, (PDCAT '09)

December 2009

Higashihiroshima, Japan

494 499

2-s2.0-77951019939

10.1109/PDCAT.2009.46

25.

Gao

Jiang

She

A new agglomerative hierarchical clustering algorithm implementation based on the map reduce framework

International Journal of Digital Content Technology and Its Applications 2010 4 3 95 100

2-s2.0-78651524475

10.4156/jdcta.vol4.issue3.9

26.

Research on clustering algorithm and its parallelization strategy

Proceeding of the International Conference on Computational and Information Sciences (ICCIS '11)

October 2011

Chengdu, China

IEEE

325 328

2-s2.0-83755185529

10.1109/ICCIS.2011.223

27.

Tan

Luo

Mr-dbscan: an efficient parallel density-based clustering algorithm using MapReduce

Proceeding of the 17th IEEE International Conference on Parallel and Distributed Systems

December 2011

Tainan, Taiwan

473 480

10.1109/ICPADS.2011.83

28.

Frey

B. J.

Dueck

Clustering by passing messages between data points

Science 2007 315 5814 972 976

2-s2.0-33847172327

10.1126/science.1136800

29.

Chenyang

Baogang

Chunhui

Zhenchao

Distributed affinity propagation clustering based on MapReduce

Journal of Computer Research and Development 2012 49 8 1762 1772

30.

Yang

Research and application of MapReduce-based MST text clustering algorithm

Proceedings of the IEEE International Conference on Information Science and Technology (ICIST '12)

March 2012

IEEE

753 757

31.

Dai

Parallel fuzzy C-means algorithm based on MapReduce

Computer Engineering and Applications 2013 49 14 133 151

32.

Papadimitriou

Sun

DisCo: distributed Co-clustering with map-reduce: a case study towards petabyte-scale end-to-end mining

Proceeding of the 8th IEEE International Conference on Data Mining, (ICDM '08)

December 2008

Pisa, Italy

IEEE

512 521

2-s2.0-67149126890

10.1109/ICDM.2008.142

33.

Cordeiro

R. L. F.

Traina

Jr. Traina

A. J. M.

López

Kang

Faloutsos

Clustering very large multi-dimensional datasets with MapReduce

Proceeding of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2011

ACM

690 698

2-s2.0-80052686089

10.1145/2020408.2020516

34.

Grossman

dSimpleGraph: a novel distributed clustering algorithm for exploring very large scale unknown data sets

Proceeding of the10th IEEE International Conference on Data Mining Workshops (ICDMW '10)

December 2010

Washington, DC, USA

162 169

2-s2.0-79951763568

10.1109/ICDMW.2010.12

35.

Vergaraa

Vembua

Ayhanb

Ryanc

M. A.

Homerc

M. L.

Huertaa

Chemical gas sensor drift compensation using classifier ensembles

Sensors and Actuators B: Chemical 2012 166-167 320 329

10.1016/j.snb.2012.01.074

36.

Rodriguez-Lujana

Fonollosaa

Vergarab

Homerc

Huertaa

On the calibration of sensor arrays for pattern recognition using the minimal number of experiments

Chemometrics and Intelligent Laboratory Systems 2014 130 123 134

10.1016/j.chemolab.2013.10.012

37.

Havens

T. C.

Bezdek

J. C.

Leckie

Hall

L. O.

Palaniswami

Fuzzy C-means algorithms for very large data

IEEE Transactions on Fuzzy Systems 2012 20 6 1130 1146

10.1109/TFUZZ.2012.2201485

38.

Han

X. D.

Xia

Z. X.

Liu

Zhou

Kernel-based fast improved possibilistic C-means clustering method

Computer Engineering and Applications 2011 47 6 176 180

A Distributed Weighted Possibilistic c-Means Algorithm for Clustering Incomplete Big Sensor Data

Abstract

1. Introduction

2. Related Work

2.1. Possibilistic c-Means Algorithms

Step 1.

Step 2.

Step 3.

Step 4.

Step 5.

2.2. Cluster Algorithms Based on Cloud Computing

3. Weighted PCM Algorithm

3.1. PCM Based on Partial Distance for Clustering Incomplete Data

Step 1.

Step 2.

Step 3.

Step 4.

Step 5.

3.2. Weighted PCM Algorithm

Step 1.

Step 2.

Step 3.

Step 4.

Step 5.

Step 6.

3.3. Time Complexity

4. Distributed Weighted PCM Algorithm Based on MapReduce

5. Experiments

5.1. Experimental Setup and Dataset

5.2. Experimental Results on Cluster Accuracy

5.3. Experimental Results on Execution Time

6. Conclusion and Future Work

Footnotes

Conflict of Interests

Acknowledgments

References