Sage Journals: Discover world-class research

Abstract

Density-based clustering for big data is critical for many modern applications ranging from Internet data processing to massive-scale moving object management. This paper proposes Cludoop algorithm, an efficient distributed density-based clustering for big data using Hadoop. First, we propose a serial clustering algorithm CluC by leveraging cell partition optimization and c-cluster to fast find clusters. CluC completes classification of the points using the relationships of connected cells around points instead of expensive completed neighbor query, which significantly reduce the number of distance calculations. Second, we propose the Cludoop, which can efficiently cluster very-large-scale data in parallel using already existing data partition on Map/Reduce platform. It employs the proposed serial clustering CluC as a plugged-in clustering on parallel mapper, along with a cell description instead of completed cell in transmission to reduce both network and I/O costs. Guided by proposed cell-based principles, we also design a Merging-Refinement-Merging 3-step framework to merge c-clusters on the overlay of assigned preclustering result on reducer. Finally, our comprehensive experimental evaluation on 10 network-connected commercial PCs, using both huge-volume real and synthetic data, demonstrates (1) the effectiveness of our algorithm in finding correct clusters with arbitrary shape and (2) the fact that our proposed algorithm exhibits better scalability and efficiency than state-of-the-art method.

1. Introduction

Clustering is to group data objects into different classes or clusters, and the objects within a cluster have high similarity, while objects in intercluster differ significantly with one another. Clustering has played a crucial role in numerous applications ranging from pattern recognition, mobile sensor networks, and moving object management to location-based service. With the rise of big data science, clustering analysis has attracted considerable interests in the big data mining, while clustering based on density is very useful to distance-based data mining in many applications with increasingly large-scale data owing to their capability to discover clusters with arbitrary shape.

Existing density-based algorithms such as DBSCAN [1], OPTICS [2], DENCLUE [3], and GDDSCAN [4] can obtain better groups of data points in the static large-scale and high-dimension databases, which were widely applied into many applications in the past decade. However applying the algorithms into current data-intensive applications is challenging due to rapidly increasing distributed stored big data. For example, the traffic data (GPS trajectories and infrared acquisition) from intelligent transportation system in Jiangsu Province reaches 6.94 billion records, and 4 TB sensor data were collected from older's healthcare monitoring in Shanghai for one week; the emerging wearable devices will further promote coming of big medical data era. The dataset released by Twitter is larger than 133 TB, increased data updated by Tencent's mobile applications reaches about 200 to 300 TB a day, and even the operational data of Yahoo reaches 5 PB. When the amount of data is this large, it is impossible to handle the data using the serial clustering methods on a single machine. Therefore, the best clustering algorithm is the one that (a) combines a scalable effective serial algorithm, (b) makes it run efficiently in distributed platform, and (c) does not need to preprocess all dataset.

For big data analysis, Map/Reduce and its open source equivalent Hadoop have attracted a lot of attentions due to its parallel way to handle massive-scale data. Hadoop is a desirable distributed computing platform based on a shared-nothing cluster architecture. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop adopts HDFS (Hadoop Distributed File System) storage structure and simple Map/Reduce programming model. HDFS provides high-throughput access to application data and initially partitions data in multiple nodes, and data is represented as <key, value> pairs. In Map and Reduce phases, Map and Reduce functions take a key-value pair as input and then output key-value pairs. The Map function is first called on different partitions of input data on each slave node in parallel. The outputs by mappers are next grouped and merged by each distinct key. Then a Reduce function is invoked for each distinct key with the list of all values sharing the key. Finally the Map/Reduce framework executes the main function on a single master machine to postprocess the outputs of Reduce functions. Depending on the applications, a pair of Map and Reduce functions can be executed once or nested multiple times. Ever since Hadoop was first introduced in 2003, Map/Reduce received great success because it allows easy development of scalable parallel application to process big data on thousands of commodity nodes, tolerating the failure nodes in the process; some works also already reflected the fact in academia (see [5]). However, finding density-based clusters from big data is a very challenging problem today.

This paper focuses on the problem of efficient, effective, and scalable density-based clustering for big data. Our method employs Hadoop as computing platform and incorporates cell partition and c-cluster optimizations into density-based clustering. Our major research works are (a) how to minimize the communication (network) cost among processing nodes, (b) how to avoid preprocessing the data, namely, reducing the I/O cost, and (c) how to extract exact density-based clusters. So this paper proposes the distributed density-based clustering using Hadoop-Cludoop method, which efficiently handles large-scale data without any preprocess.

The main contributions of this paper include the following. (1) We propose a c-cluster definition along with cell-connected observations to significantly reduce computational cost of neighbor range query. (2) We propose an efficient serial clustering algorithm CluC leveraging c-clusters and neighbor searching optimization. (3) We propose the distributed clustering framework for big data using Map/Reduce structure based on the proposed serial clustering algorithm as a plugged-in clustering on Map, along with a 3-step merging framework on Reduce. (4) We conduct comprehensive experiments on Hadoop platform deployed on 10 commodity machines to evaluate the performance of Cludoop using larger-scale real and synthetic data. The results show that our methods are both effective and efficient.

The paper is organized as follows: we first discuss related work in Section 2. In Section 3, we present preliminary notions and problem statement. Section 4 introduces theoretical ideas and presents the serial clustering algorithm. In Section 5, we then present the distributed Cludoop algorithm. In Section 6, we perform an experimental evaluation of the effectiveness, efficiency, and scalability of our algorithm. Section 7 concludes whole paper with a summary.

2. Related Work

In this section, we mainly review related work in the areas of density-based clustering, parallel variants of DBSCAN, and distributed clustering on Map/Reduce platform.

Density-Based Clustering. Density-based methods describe the clusters being the high density area of points separated from the low-density regions in the data space, so clusters exhibit arbitrary shapes in the high-density region. There are two types of density-based clustering method: algorithms based on local connectivity such as DBSCAN [1] and OPTICS [2] and algorithms based on density function such as DENCLUE [3]. DBSCAN determines a nonhierarchical, disjoint partitioning of the data into several clusters. Clusters are expanded starting at arbitrary seed points within dense areas. Objects in areas of low density are assigned to a separate noise partition. DBSCAN is robust against noise and the user does not need to specify the number of clusters in advance. OPTICS is proposed to solve parameters selection problem on density-based clustering algorithms. The paper proposes two concepts: core-distance and reachability-distance to organize points. The point objects would be ordered by reachability-distance and their core point to obtain the clustering structure, which contains hierarchical clusters under a broad range of parameter settings. OPTICS visualizes clearly the cluster structure via the ordered point list and could find arbitrary shaped clusters and overlapping clusters. DENCLUE formalizes the cluster notion by nonparametric kernel density estimation based on modelling the local influence of each data object on the feature space by a simple kernel function, for example, Gaussian. It defines a cluster as a local maximum of the probability density.

Parallel Variants of Density-Based Clustering. Xu et al. [6] propose a parallel clustering algorithm PDBSCAN for mining large distributed spatial databases. It uses the so-called shared-nothing cluster architecture, which has the main advantage that it can be scaled up to a high number of computers. However it is not fully paralleled while it still needs a single node to aggregate intermediate results. Januzaj et al. [7] depict a distributed clustering based on DBSCAN and DBDC and formed local and global two-level clustering. The local clustering is carried out independently on local data; then global clustering is done on a central server based on the transmitted representatives from local clustering. Subsequently, they design a density-based distributed clustering [8] which allows a user-defined tradeoff between clustering quality and the number of transmitted objects from the local sites to the global server site based on DBDC idea. They first order all objects located at a local site according to a quality criterion reflecting their suitability to serve as local representatives and then send the best of these representatives to a server site where they are clustered with a slightly enhanced DBSCAN algorithm to obtain high quality clusters. Dash et al. [9] propose a parallel hierarchical agglomerative clustering based on partially overlapping partitioning on shared memory multiprocessor architecture for handling nested data. The experiment shows the algorithm achieves near linear speedup. Brecheisen et al. [10] present a parallel DBSCAN on a workstation, which is parallelized by a conservative approximation of complex distance functions, based on the concept of filter merge points. The final result is derived from a global cluster connectivity graph. Böhm et al. [11] implement several data mining tasks under the highly parallel environment of GPUs, including similarity search and clustering. For density-based clustering, they design a parallel DBSCAN algorithm supported by their proposed similarity join on GPU. They then propose a massively parallel density-based clustering method [12] using GPUs by leveraging the high parallelism combined with a high bandwidth in memory transfer at low cost. Andrade et al. [13] also propose a GPU parallel version of DBSCAN, named G-DBSCAN, using graph to index point objects with less than a given distance threshold to each other. However the parallel density-based clustering can not straightforward be transferred on the Hadoop, because not only have GPU-capable parallel algorithms shared main memory but groups of the processors even share very fast memory units, while Hadoop uses a distributed file system.

Distributed Clustering on Map/Reduce. Ene et al. [14] develop partition-based clustering algorithms, k-center and k-median, running on Map/Reduce, which use sampling to decrease the data size and run in a time constant number of Map/Reduce rounds. For density-based clustering, He et al. [15] propose a parallel DBSCAN using Map/Reduce framework, MR-DBSCAN, which partitions all spatial data into different Maps by the space location in the preprocessing stage and then performs DBSCAN in each mapper and merges the bordering spaces in Reduce step. Similarly, Dai and Lin [16] also propose a Map/Reduce-based DBSCAN with a data partition, partition with Reduce boundary points (PRBP), selecting partition boundaries based on the distribution of data points. However the data partition based on data space easily causes load unbalancing due to the sparse data. Subsequently, He et al. [17] also propose a load balancing mechanism based on a cost-based spatial partitioning for heavily skewed data. However above methods expend I/O cost due to preprocessing, especially for the distributed stored big data. This is also one of the key points which we resolve in this paper. Cordeiro et al. [18] present a clustering solution to find subspace for high-dimensional data using Map/Reduce; however they focus on the tradeoff problem between the I/O cost and network cost and aim to dynamically choose the best strategy. These techniques did not explore the optimization opportunities enabled by the nature insights of serial density-based clusters; this is exactly what our work does for delivering highly scalable distributed solutions along with an efficient optimized serial density-based clustering.

3. Preliminaries and Problem Statement

3.1. Definitions and Theorems

This section introduces the definitions of related terms based on the notion of connected density in [1]. $E p s$ and $M i n P t s$ are the neighbor range parameter and the minimum number of neighbor points, respectively.

Definition 1 ( $E p s$ -range).

The $E p s$ -range of a point p is the circular area with radius $E p s$ centered on the specific point p.

Definition 2 ( $E p s$ -neighbor).

All points included in $E p s$ -range of the point p are called $E p s$ -neighbor of p, denoted by $N_{E p s} (p)$ .

Therefore, $N_{E p s} (p)$ is a set of data points. Let $| N_{E p s} (p) |$ denote the cardinality of $N_{E p s} (p)$ . Obviously, the neighbor is symmetric for pairs of points.

Definition 3 (core point).

A point p is called a core point if $| N_{E p s} (p) | \geq M i n P t s$ .

Definition 4 (border point).

For a point p, if $| N_{E p s} (p) | < M i n P t s$ and $\exists q (q \in N_{E p s} (p))$ is a core point, the point p is a border point of the cluster including point q.

Definition 5 (isolated point).

A point p is classified as an isolated point if it is neither a core point nor a border point.

Note that isolated points can be considered as either anomalous points or noise.

Definition 6 (directly density reachable).

A point p is directly density reachable from another point q, if the point p is one of $E p s$ -neighbors of q, and q is a core point.

The $E p s$ -neighbors of a core point p are directly density reachable from p, and the border points are directly density reachable from their $E p s$ -neighbors which are core points.

Definition 7 (density reachable).

A point p is density reachable from a point q, if there is a chain of points $p_{1}, p_{2}, \dots, p_{n}$ , $p_{1} \leftarrow p$ , $p_{n} \leftarrow q$ such that $p_{i - 1}$ is directly density reachable from $p_{i}$ .

Definition 8 (density connected).

A point p is density connected to a point q, if there exists a point o such that both p and q are density reachable from o.

Definition 9 (density-based cluster (d-cluster)).

Let D be points set. A density-based cluster is a nonempty subset of D including at least a core point and all points which are density reachable from the core point.

Note that all core points would be classified in density-based clusters.

Lemma 10.

If a core point p belongs to two d-clusters $C_{1}$ and $C_{2}$ , respectively, the $C_{1}$ and $C_{2}$ can be merged into one d-cluster.

Proof.

Since point $p \in C_{1}$ and p is a core point, then $N_{E p s} (p) \in C_{1}$ . According to Definition 9, the d-cluster $C_{1}$ can be expanded to $C_{1} = {o | o \in D$ and o is density reachable from $p}$ . Similarly, due to $p \in C_{1}$ , $C_{2}$ also is equivalent to ${o | o \in D$ and o is density reachable from $p}$ . So $C_{1}$ and $C_{2}$ should be merged into one d-cluster. In summary, a core point only belongs to a d-cluster.

Figure 1 shows an example demonstrating Lemma 10; p is a core point and belongs to both blue and green d-clusters, so the two d-clusters would be merged.

Figure 1

An example of Lemma 10; p is a core point.

Definition 11 (cell).

According to the $E p s$ , the data space is divided into cubes whose length of the sides is $E p s / (2 * \sqrt{2})$ , and each partition of space is called a $C e l l$ . A cell is denoted by $G_{(d_{1}, d_{2})} (d_{1} = 2 * \sqrt{2} * x / E p s$ , $d_{2} = 2 * \sqrt{2} * y / E p s)$ , if the cell's center is $(x, y)$ in 2D space.

Let $| G_{(d_{1}, d_{2})} |$ denote the number of points falling in $G_{(d_{1}, d_{2})}$ . In this paper, we describe the concept of cell in 2D space for generality, though m-dimensional space could equally be applied.

Lemma 12.

Given a point p in the cell $G_{(x, y)}$ , if $\sum_{i = - 1, j = - 1}^{1,1} ‍ | G_{(x + i, y + j)} | \geq M i n P t s$ , then point p is a core point.

Proof.

Lemma 12 can be easily proven using the definition of $C e l l$ . Since the edge length of cell is $E p s / (2 * \sqrt{2})$ , the distance between any point in the cells $G_{(x + i, y + j)} (i = - 1,0, 1; j = - 1,0, 1)$ and p is not greater than $E p s$ . Therefore, if $\sum_{i = - 1, j = - 1}^{1,1} ‍ | G_{(x + i, y + j)} | \geq M i n P t s$ , then $N_{E p s} (p)$ must be $\geq M i n P t s$ ; thus p is a core point.

Thus all points in the cells $G_{(x + i, y + j)} (i = - 1,0, 1; j = - 1,0, 1)$ are directly density reachable from p. These 9 cells are called inner cells with respect to p. By Lemma 12, we can further deduce Corollary 13.

Corollary 13.

Given a cell $G_{(x, y)}$ , if $\sum_{i = - 1, j = - 1}^{1,1} ‍ | G_{(x + i, y + j)} | \geq M i n P t s$ , then $\forall p (p \in G_{(x, y)})$ is a core point.

Lemma 14.

Given a point p in the cell $G_{(x, y)}$ , there exist at most 36 cells in which the points need to be checked when verifying whether p is a core point or not.

Proof.

Lemma 14 is intuitive by Lemma 12 and definition of $E p s$ -neighbor (Definition 2). We directly demonstrate this in Figure 2. Suppose the central cell is $G_{(x, y)}$ . First, for $\forall p (p \in G_{(x, y)})$ , the $E p s$ -range of p must be located within the area according to the side of cell. Second, all of the 36 cells would be covered in the $E p s$ -range of some point in $G_{(x, y)}$ ; in extreme cases, the number 17 to number 36 cells are covered when the p is located at four vertexes. These 36 cells are called outer cells with respect to p.

According to the d-cluster definition (Definition 9) and Lemma 12, we observe that if a core point p in the cell $G_{(x, y)}$ belongs to a d-cluster C, then all points in inner cells with respect to p belong to C. If all points falling in a cell G can be determined to belong to a d-cluster C, we call it an inclusive cell of C. Therefore, the d-cluster C can be represented by all inclusive cells and other border points not in the inclusive cells. Here we give a formalized definition of cell-cluster (c-cluster) by the cell-connected observation.

Figure 2

Demonstration of Lemma 14.

Definition 15 (cell-cluster (c-cluster)).

Given a d-cluster C, a cell-cluster is 2-tuple ${I G s, P s}$ , where $I G s$ is the set of all inclusive cells of C and $P s$ is the set of all border points not in the inclusive cells. The cell-cluster ${I G s, P s}$ is called equivalent cell-cluster with respect to d-cluster C.

c-Cluster simplifies the d-cluster by leveraging inclusive cells instead of large amounts of density-connected points. One no longer checks the status of every point in the clustering; instead one only changes the cluster identifier of the inclusive cells. Therefore, one uses c-cluster instead of d-cluster in the subsequent clustering process.

Definition 16 (exclusive cell).

Given a cell $G_{(x, y)}$ and a c-cluster C, if there exists a core point in both of $G_{(x, y)}$ and C, then $G_{(x, y)}$ is called an exclusive cell of c-cluster C.

Obviously, an exclusive cell must be an inclusive cell of the c-cluster, while an inclusive cell may not be an exclusive cell.

Lemma 17.

Given a cell $G_{(x, y)}$ and a c-cluster C, if $G_{(x, y)}$ is an exclusive cell of C, then the inner cells of $G_{(x, y)}$ must be inclusive cells of c-cluster C.

Proof.

Lemma 17 is intuitive. Since there exists one core point p in $G_{(x, y)}$ that belongs to the d-cluster corresponding to C, then all points in inner cell of p belong to the d-cluster; thus the inner cells are inclusive cells of c-cluster C.

Lemma 18.

Given a cell $G_{(x, y)}$ and a c-cluster C, if $G_{(x, y)}$ is an exclusive cell of C, then any c-clusters containing $G_{(x, y)}$ as an inclusive cell can be merged into C.

Proof.

Lemma 18 is similar to Lemma 10. Since $G_{(x, y)}$ is an exclusive cell of C, there exists at least one core point p in $G_{(x, y)}$ that belongs to the d-cluster corresponding to C. If there exists a c-cluster $C^{'}$ that $G_{(x, y)} \in C^{'}$ . $I G s$ , then p also belongs to the counterpart of $C^{'}$ . Thus $C^{'}$ should be merged into C by Lemma 10.

3.2. Problem Statement

In this work, we focus on the efficient parallel solution of finding density-based clusters from big data on Hadoop platform. Next, we define the distributed clustering problem based on the above definitions, which this paper just aims to resolve.

Problem (distributed clustering with c-cluster). Given a massive-scale dataset D, parameter setting ( $E p s$ and $M i n P t s$ ) and network-connected computers $C P$ deployed Hadoop platform, the distributed clustering is that we output exact c-clusters from D with respect to the given $E p s$ and $M i n P t s$ on the $C P$ .

4. Proposed Serial Clustering with c-Cluster

Our proposed distributed clustering solution is designed based on one optimized serial clustering method that we propose next: clustering with c-cluster—CluC. CluC aims to find out c-clusters from a large-scale dataset. In the following, we present a basic version of CluC omitting details of data structure and additional information for understanding, as shown in Algorithm 1.

Algorithm 1: Clustering with c-cluster (CluC) method.

Input: dataset D, parameters: $E p s$ and $M i n P t s$ .

Output: c-clusters

(1) $/ / D$ is Unclassified

(2) Get SetOfCells from D;

(3) $/ /$ SetOfCells is Unclassified

(4) for each point $p \in D$ do

(5) Get Cell $G \leftarrow p . c o o r d i n a t e$ ;

(6) if $G . c i d$ = Unclassified then

(7) if $p . c i d$ = Unclassified then

(8) Create c-cluster $C \leftarrow C l u s t e r (c i d)$ ;

(9) if EXPANDCLUSTER ( $p, G, D, C, E p s, M i n P t s$ ) then

(10) Create c-cluster $C \leftarrow C l u s t e r (c i d$ ++);

(11) end if

(12) end if

(13) end if

(14) end for

(15) function EXPANDCLUSTER ( $p, G, D, C, E p s, M i n P t s$ )

(16) if $\sum ‍ | i n n e r C e l l s (G) | \geq M i n P t s$ $| |$ p is a core point then

(17) $C . I G s . a d d (G)$ ; $C . E G s . a d d (G)$ ; $G s e e d s . a d d (i n n e r C e l l s (G))$ ;

(18) $G s e e d s . d e l e t e (G)$ ;

(19) while $G s e e d s$ <> Empty do

(20) $G \leftarrow G s e e d s . g e t F i r s t ()$ ;

(21) if $\sum ‍ | i n n e r C e l l s (G) | \geq M i n P t s$ then

(22) if $G . c i d$ <> $U n c l a s s i f i e d$ and $G . c i d$ <> $C . c i d$ then

(23) $C . m e r g e (g e t C l u s t e r (G . c i d)) / /$ Merge two c-clusters;

(24) end if

(25) $C . I G s . a d d (G)$ ; $C . E G s . a d d (G)$ ;

(26) $G s e e d s . a d d (i n n e r C e l l s (G))$ ;

(27) else if $\exists p \in G, p$ is a core point then

(28) if $G . c i d$ <> $U n c l a s s i f i e d$ and $G . c i d$ <> $C . c i d$ then

(29) $C . m e r g e (g e t C l u s t e r (G . c i d)) / /$ Merge two c-clusters;

(30) else if $p . c i d$ <> $U n c l a s s i f i e d$ and $p . c i d$ <> $C . c i d$ then

(31) $C . m e r g e (g e t C l u s t e r (p . c i d)) / /$ Merge two c-clusters;

(32) end if

(33) $C . I G s . a d d (G)$ ; $C . E G s . a d d (G)$ ;

(34) $G s e e d s . a d d (i n n e r C e l l s (G))$ ;

(35) else

(36) $C . I G s . a d d (G)$ ;

(37) end if

(38) $G s e e d s . d e l e t e (G)$ ;

(39) end while

(40) for each cell $G \in o u t e r C e l l s (C . E G s)$ do

(41) if $G . c i d$ <> $C . c i d$ then

(42) for each point $p \in G$ do

(43) if $d i s t (p, C . E G s) \leq E p s$ then

(44) $C . P s . a d d (p)$ ; $p . c i d \leftarrow C . c i d$

(45) end if

(46) end for

(47) end if

(48) end for

(49) return true;

(50) else

(51) $p . c i d \leftarrow N o i s e$ ; return false;

(52) end if

(53) end function

$E p s$ and $M i n P t s$ are density parameters same as in DBSCAN. We can get the set of cells according to the dataset D and $E p s$ . For each point p, if its cell and itself are not yet included in any c-cluster, the function of $E x p a n d C l u s t e r$ would be performed to get the c-cluster including all points density-connected to p. The $i n n e r C e l l s (G)$ stands for the set of 9 inner cells around G, including itself. CluC verifies whether p is a core point by Lemmas 12 and 14 and no longer uses $N_{E p s} (p)$ via the completed neighbor range query.

The $C . I G s$ and $C . E G s$ maintain the set of inclusive cells and exclusive cells of c-cluster C, respectively. The set $G s e e d s$ keeps the unchecked inclusive cells, and $G s e e d s . g e t F i r s t ()$ returns the first cell in the set. CluC first expands the c-cluster by the connected cells by Lemma 17, as shown on lines 19–29, and then further checks the border points in bordering cells as lines 40–48. $C . m e r g e (g e t C l u s t e r (G . c i d))$ merges the previous c-cluster of the cell G into the c-cluster C; this operator is performed by Lemma 18 because the cell G is an exclusive cell of C. The similar reason can be explained for $C . m e r g e (g e t C l u s t e r (p . c i d))$ . However, the point labelled with $N o i s e$ on line 51 may be changed later, if it is density reachable from some points.

The $C . P s$ keeps all border points of c-cluster C not in inclusive cells. The $o u t e r C e l l s (C . E G s)$ denotes set of all outer cells with respect to each cell in $C . E G s$ . The $d i s t (p, C . E G s)$ represents the smallest distance between p and any points in the $C . E G s$ . Namely, for the point p, not in inclusive cells, if there exists any point q in the $C . E G s$ such that $d i s t (p, q) \leq E p s$ , the p is a border point with respect to the c-cluster C.

CluC first expands the inclusive cells recursively to reduce a large amount of distance calculations; then it can fast searches the border points in the pruned space that eliminate the inclusive cells. Therefore, the CluC achieves an efficient and also effective improvement compared with the state-of-the-art serial methods.

5. Cludoop: Distributed Density-Based Clustering on Hadoop

5.1. Cludoop Framework

The overall architecture of our Cludoop framework is depicted in Figure 3. Cludoop uses the existing data partition and does the parallel CluC algorithm in mappers and then does the merging work based on intermediate c-clusters in reducers. The final c-clusters that consist of inclusive cell descriptions and a tiny amount of border points are obtained on one reducer/machine.

Figure 3

Overall architecture of Cludoop clustering.

5.2. Preclustering on Mapper

As shown in Figure 3, Cludoop starts with m mappers reading the data in parallel from the HDFS and employs CluC as a plugged-in clustering on each mapper. We call the phase preclustering to distinguish whole clustering. In this phase, we first build the cell index according to the received dataset, mapping the points into the cells. Then each mapper performs CluC algorithm on the cell-structure data in parallel, aiming to output the c-clusters and noises as preclustering result. However, the preclustering results need to be normalised and shipped to the appropriate reducers over the network, to get final clusters. Thus the network traffic and normalization would cost much CPU resources when shipping all c-clusters and noises to reducers. How can we reduce the network cost for this large-scale dataset?

Our main idea is to only ship the c-clusters with inclusive cells’ descriptions and the border points to reducers, rather than send all points which have already been classified into c-clusters. Thus, we use a simple description ( $t h e i d e n t i f i e r o f c e l l$ — $I D$ , $t h e n u m b e r o f p o i n t s$ —n, $i s a n e x c l u s i v e c e l l ?$ , and $t h e i d e n t i f i e r o f c$ - $c l u s t e r$ — $c i d$ ) to represent an inclusive cell; however the description provides sufficient information with reducers for further merging process. This is because of the following. (1) We build the cells according to the received points and given $E p s$ in each of the mappers, so the cell with the same $I D$ in different mappers stands for the same area; thus we can only use the cell $I D$ to locate the assigned range in overall data space. (2) The points in an inclusive cell must belong to a c-cluster, so the cell $I D$ and c-cluster $c i d$ are enough to stand for classification of all points in the cell. (3) The border points or noise in/nearby an inclusive cell in other mappers need to refine their status in Reduce phase; however the number of points in the inclusive cell is sufficient to determine their status. (4) By Lemma 18, the marked exclusive cell can be used to directly merge the c-clusters sharing the cell in the Reduce phase. Using the inclusive cell's description, only the border points of c-cluster and noise need to be shuffled through the network to other machines. This significantly reduces the amount of data points shipped in the shuffling step, with the consequent savings for the network cost, as well as the I/O cost for the intermediate clustering results. Therefore, the c-cluster in subsequent Reduce phase actually is the pruned c-cluster, namely, only including the inclusive cells’ descriptions not the completed cells.

To efficiently merge the c-clusters located in near area on different mappers, meanwhile to avoid the computation unbalancing for skewed data, we combine the spatial distribution and uniform division in shuffle mechanism. We set $r K e y$ s according to the number of reducers; similar with spatial partitioning, we first normalize the mean coordinate of inclusive cells of each c-cluster to the corresponding nearest $K e y$ value. Thus the c-clusters with same $K e y$ value on different mappers would be sent to the same reducer in the shuffle step. Then we shift the c-clusters from crowded $K e y$ to near unoccupied $K e y$ for balancing the workload in Reduce phase. So our Cludoop achieves a full load balancing throughout whole Map/Reduce phases.

5.3. Merging-Refinement-Merging Framework on Reducer

In Reduce phase, each reducer receives the normalized c-clusters and noises, denoted by ${C s, N s}$ , where $C s$ is the set of c-clusters with cell's descriptions and $N s$ is the set of assigned noises. We propose a Merging-Refinement-Merging 3-step framework based on following cell-based merging and refinement principles on reducer to merge the assigned intermediate c-clusters.

First, we observe that the c-clusters should be merged when they share certain cells or connected cells. To characterize the merging process we formalize two merging rules.

Merging 1.

Given an exclusive cell $G_{(x, y)}$ in a c-cluster C, all c-clusters in $C s$ , which have $G_{(x, y)}$ as an inclusive cell or have any border point falling in $G_{(x, y)}$ , can be merged into C.

Merging 1 is intuitive. First, since $G_{(x, y)}$ is an exclusive cell of C, there must exist at least one core point in $G_{(x, y)}$ . The core point also must be a core point of any c-clusters covering $G_{(x, y)}$ on the overlay of ${C s, N s}$ . The $G_{(x, y)}$ also is an exclusive cell of the c-clusters; thus the c-clusters can be merged by Lemma 18. Second, if one border point p of c-cluster $C^{'}$ falls in $G_{(x, y)}$ on the overlay, then p also belongs to C, and p also would be changed to be a core point by the definition of exclusive cell. Therefore, the C and $C^{'}$ should be merged by Lemma 10.

Merging 2.

Given an exclusive cell $G_{(x, y)}$ in a c-cluster C, all c-clusters in $C s$ , which have any exclusive cell in the $(i n n e r C e l l s (G_{(x, y)}) - G_{(x, y)})$ , can be merged into C.

Merging 2 implies that two c-clusters covering one of two connected exclusive cells, respectively, should be merged into one cluster. This is obvious, because any core points in these two cells are neighbors; thus the density reachable core points would be classified into one cluster. Therefore, they should be merged.

Actually, Merging 1 already contains the same cases of Merging 2. For example, given two connected cells $G_{1}$ and $G_{2}$ , $G_{1}$ is an exclusive cell of c-cluster $C_{1}$ , and $G_{2}$ is an exclusive cell of c-cluster $C_{2}$ . Obviously, $G_{2}$ must be an inclusive cell of $C_{1}$ , and $G_{1}$ also is an inclusive cell of $C_{2}$ . Therefore, the two c-clusters would be merged by Merging 1. However, Merging 2 supplies the processing method for the special case of vacant cells, such as the cell $G_{2}$ which is null in the c-cluster $C_{1}$ from one mapper, and the cell $G_{1}$ is also null in $C_{2}$ from another mapper in the above example. In this case, Merging 1 can not work, while Merging 2 complementally handles the situation.

Next, we analyse the refinement process for the nonexclusive inclusive cells, border points, and noises on the overlay of ${C s, N s}$ in one reducer and propose three refinement rules for these three situations, respectively.

Refinement 1.

For an inclusive cell $G_{(x, y)}$ of a c-cluster C, if $| G_{(x, y)} | \geq M i n P t s$ or $\sum_{i = - 1, j = - 1}^{1,1} ‍ | G_{(x + i, y + j)} | \geq M i n P t s$ on the overlay of ${C s, N s}$ , then $G_{(x, y)}$ can be changed to be an exclusive cell of C.

Refinement 1 can be easily shown. On the overlay of ${C s, N s}$ , some inclusive cells’ exclusivity can be directly determined according to the updated number of points in each cell.

Refinement 2.

Given a c-cluster C, for a border point $p \in C . P s$ , if $| G (p) | \geq M i n P t s$ , $\sum ‍ | i n n e r C e l l s (p) | \geq M i n P t s$ , or $N_{E p s} (p) \geq M i n P t s$ on the overlay of ${C s, N s}$ , then $G (p)$ can be marked to be an exclusive cell of C.

Similarly Refinements 1 and 2 depict the update case for the border points not in inclusive cells on the overlay of ${C s, N s}$ . Note that we can not compute the exact number of neighbors for point p, because we use the cell's description instead of the points in the inclusive cell. In refinement phase, we employ an approximate method to get $N_{E p s} (p)$ when checking if p is a core point. If all vertexes of a cell are in $o u t C e l l s (p)$ whose distance to p is no large than $E p s$ , then the points in the cell are neighbors of p; also if there exist vertexes whose distance to p is larger than $E p s$ , we calculate the approximate rate β of area covered in the $E p s$ - $r a n g e$ of p in the cell and use the $β * n$ to estimate approximately the number of neighbors for p in the cell.

Refinement 3.

Given a noise $p \in N s$ , if $| G (p) | \geq M i n P t s$ , $\sum ‍ | i n n e r C e l l s (p) | \geq M i n P t s$ , or $N_{E p s} (p) \geq M i n P t s$ on the overlay of ${C s, N s}$ , then $G (p)$ can be marked to be an exclusive cell of a new c-cluster.

Refinement 3 indicates the update process for noise on the overlay of assigned preclustering results after first-round merging.

Based on the above merging and refinement rules, we describe the Merging-Refinement-Merging framework to merge the preclustering in Reduce phase. The pseudocode of the framework is shown in Algorithm 2. First, we execute directly the first-round merging using the proposed two merging rules (lines 2–10). This step would merge the overlapping c-clusters and noise in ${C s, N s}$ to fast reduce the number of c-clusters and noise and update the sharing cells’ point number n at the same time. Next, the status of nonexclusive cells, unchecked border points, and noise is further refined on the overlay of ${C s, N s}$ (lines 12–40). As shown in Algorithm 2, Refinement 1 is employed first for the nonexclusive cells. $G^{{C s, N s}}$ denotes the updated cell G on the overlay of ${C s, N s}$ . Then Refinements 2 and 3 are performed in turn for unchecked border points and noise. In process of Refinement 1, most border points and noise candidates falling in the cells also are processed; this greatly reduces the number of points that need to be handled in Refinements 2 and 3. This is also why we first perform Refinement 1. More specifically we also first utilize Lemma 12 to check whether a cell is an exclusive cell or a point is a core point to reduce a large amount of distance calculation for most cases. Finally, second-round merging is performed to reexamine the c-clusters on the updated cells and border points and get the final c-clusters over ${C s, N s}$ .

Algorithm 2: Merging-Refinement-Merging framework on reducer.

Input: assigned ${C s, N s}$ , parameters: $E p s$ and $M i n P t s$ .

Output: c-clusters

(1) //First-round Merging

(2) for each c-cluster $C \in C s$ do

(3) for each cell $G \in C . I G s$ do

(4) if $G . i s E x c l u s i v e = = t r u e$ then

(5) Merge all c-clusters by Merging 1;

(6) //update the number of point in sharing inclusive cells and border points

(7) Merge all c-clusters by Merging 2;

(8) end if

(9) end for

(10) end for

(11) //Refinement

(12) for each c-cluster $C \in C s$ do

(13) for each cell $G \in C . I G s$ do

(14) if $G . i s E x c l u s i v e = = F a l s e$ then

(15) if $| G^{{C s, N s}} | \geq M i n P t s$ $| |$ $\sum ‍ | i n n e r C e l l s (G)^{{C s, N s}} | \geq M i n P t s$ then

(16) $G . i s E x c l u s i v e \leftarrow t r u e$ ; $s e e d s . a d d (G)$ ;

(17) $C . I G s . a d d (i n n e r C e l l s (G))$ ;

(18) end if

(19) end if

(20) end for

(21) for each point $p \in C . P s$ do

(22) if $| c e l l (p)^{{C s, N s}} | \geq M i n P t s$ $| |$ $\sum ‍ | i n n e r C e l l s (p)^{{C s, N s}} | \geq M i n P t s$ $| |$ $N_{E p s} (p) \geq M i n P t s$ then

(23) $c e l l (p) . i s E x c l u s i v e \leftarrow t r u e$ ; $s e e d s . a d d (c e l l (p))$ ;

(24) $C . I G s . a d d (i n n e r C e l l s (p))$ ;

(25) end if

(26) end for

(27) end for

(28) for each noise $p \in N s$ do

(29) get cell $G \leftarrow p . c o o r d i n a t e$ ;

(30) if $| G^{{C s, N s}} | \geq M i n P t s$ $| |$ $\sum ‍ | i n n e r C e l l s (G)^{{C s, N s}} | \geq M i n P t s$ $| |$ $N_{E p s} (p) \geq M i n P t s$

then

(31) $G . i s E x c l u s i v e \leftarrow t r u e$ ; $s e e d s . a d d (G)$ ;

(32) if G is an existing inclusive cell of $C s$ then

(33) $g e t C l u s t e r (G) . I G s . a d d (i n n e r C e l l s (p))$ ;

(34) else

(35) create new c-cluster C; $C . I G s . a d d (i n n e r C e l l s (p))$ ;

(36) end if

(37) else if p falls in an existing inclusive cell on ${C s, N s}$ then

(38) update the cell; delete p from $N s$ ;

(39) end if

(40) end for

(41) //Second-round Merging

(42) for each cell $G \in s e e d s$ do

(43) Merge all c-clusters by Merging 1;

(44) Merge all c-clusters by Merging 2;

(45) end for

In the phase, r reducers execute the Merging-Refinement-Merging process in parallel. However we still need to perform one reducer to return final clusters on one machine in final phase as shown in Figure 3.

6. Experimental Study

6.1. Experimental Setup and Methodologies

A comprehensive performance study has been conducted on 10 network-connected commodity computers that deployed Hadoop platform to evaluate the effectiveness and efficiency of the proposed Cludoop algorithm. Each PC/node is equipped with an Intel Core 2 Duo P8600 processor with 8 GB memory and runs a Ubuntu 11.4 operating system. One node is configured as both NameNode and JobTracker. Other nodes are configured as DataNode and TaskTracker.

Real Datasets. We use two real datasets in our experiments. The first real dataset is a trajectory database from $G e o l i f e$ project [19] executed by Microsoft Research Asia. This dataset includes 178 users’ 164,789 location points from May 7 to May 16, 2007. The second real dataset is $T a x i$ data, a real GPS trajectory data generated by 10,357 taxis in a period from February 2 to February 8, 2008, in Beijing [20]. The total number of points in this dataset is about 15 million, reaching 1.46 GB size.

Synthetic Datasets. We also generate two synthetic datasets by our developed data generator using Matlab. The first synthetic dataset, denoted by $S s y n$ , is generated to test the effectiveness of clustering, which includes multiple clusters of arbitrary shape comprised of 4413 points and 300 noises in $1000 \times 1000$ square.

The second large-scale dataset $L s y n$ is generated to evaluate performance of algorithm on processing big data, which contains about 2 billion data points in 2D space, reaching 53.6 GB size. For generality, we normalize all coordinates into range $[0,1]$ .

Alternative Algorithms. Our experiments focus on evaluating the effectiveness and efficiency of proposed Cludoop. For the effectiveness evaluation, we compare directly clustering results of our algorithm against the most wildly used method DBSCAN. For the efficiency evaluation, we compare the CPU time utilized by our algorithm against the state-of-the-art method MR-DBSCAN as introduced in Section 2.

Metrics and Methodology. We measure the quality of clusters for effectiveness by Precision and Recall as follows: $Precision = (R_{c} \cap D_{c}) / R_{c}$ , $Recall = (R_{c} \cap D_{c}) / D_{c}$ , where $D_{c}$ denotes the set of points labelled cluster ID or clusters obtained by DBSCAN and $R_{c}$ is the clustering result of our algorithm.

We also evaluate the performance of Cludoop method by varying the most important parameters. In particular, we measure the scalability of methods by varying the volume of dataset and the number of work nodes. Moreover, we also measure sensitivity of our Cludoop algorithm on efficiency by varying the input parameter $E p s$ in a large range.

6.2. Effectiveness Evaluation

First, we evaluate the effectiveness of our proposed algorithm compared to DBSCAN using two evaluation metrics (visualization comparison, Precision/Recall) on Ssyn and GeoLife data.

Figure 4 shows the clustering result comparison of two algorithms on Ssyn data when $E p s$ is fixed to 20 and $M i n P t s$ to 4. The black points are classified as noise. We can see that Cludoop can find almost the same clusters as DBSCAN with respect to the same parameter setting. Only a very few of marginal border points are misclassified due to the proximate strategy employed for some border points in Reduce phase. Figure 5 depicts the results of two algorithms on GeoLife data, setting $E p s$ to 100 meters and $M i n P t s$ to 10. Again, our algorithm shows an excellent clustering ability, even better than that on Ssyn data. This is because the locations of GeoLife are on or close to road networks, and only the location crowds at hot regions or crossings are classified as clusters; thus the ratio of border points not in inclusive cells is far below that of arbitrary distributed dataset.

Figure 4

Visualization comparison on Ssyn data.

Figure 5

Visualization comparison on GeoLife data.

The Precision and Recall of Cludoop on Ssyn data are shown in Figure 6. For Ssyn data, the Precision and Recall are computed based on labelled points. From Figure 6(a), the Precision is nearly 100% once the parameter $E p s$ falls in a rather large range. Figure 6(b) shows the Recall of our algorithm is also good in certain parameters space. For the real GeoLife data, we compute the Precision and Recall of Cludoop algorithm based on the clustering result of DBSCAN with respect to the same parameters, excluding the noise. From Figure 7, the Precision and Recall of clusters of Cludoop achieve on average 97% and 96% compared to DBSCAN in all test cases, respectively. In particular, Cludoop shows nearly 100% Precision when $E p s \leq 100$ and $M i n P t s$ varies from 5 to 20.

Figure 6

Effectiveness evaluation on Ssyn data.

Figure 7

Effectiveness evaluation on GeoLife data.

In summary the effectiveness study confirms that our distributed algorithm can obtain the correct clusters with noise under a loose parameter setting.

6.3. Efficiency Evaluation

Next we evaluate the efficiency of our algorithm compared to MR-DBSCAN algorithm using the Taxi and Lsyn data. We vary the most important parameters, to (1) assess the scalability of Cludoop versus MR-DBSCAN in terms of efficiency and (2) evaluate sensitivity of parameter variation on our method.

6.3.1. Varying Volume of Dataset D

We first evaluate the scalability of our algorithm in terms of the volume of dataset using Lsyn data. In this experiment we randomly extract four subsets from the Lsyn data from 20% to 80%. We fix the $E p s$ to 0.001 and $M i n P t s$ to 200. Figure 8 shows the total running time of our algorithm and MR-DBSCAN on the five datasets. Cludoop algorithm exhibits much better scalability than MR-DBSCAN in terms of CPU time. In particular, Cludoop saves on average 42% time compared to MR-DBSCAN on all tested cases. As the volume of dataset increases, both algorithms require more time to process these increased data points. However, our Cludoop algorithm saves more time with the rising of the number of data points. This is because more cells could be determined directly to be inclusive cells to reduce more distance calculations as the volume of D increases at fixed $E p s$ and $M i n P t s$ .

Figure 8

Varying volume of dataset D on Lsyn data.

6.3.2. Varying Number of Nodes $n d$

Next, we valuate the speedup of our algorithm as the number of nodes increases by varying work node from 2 to 9 on the Taxi data when $E p s$ is fixed to 100 meters and $M i n P t s$ to 4. As shown in Figure 9, Cludoop algorithm also clearly outperforms MR-DBSCAN in terms of running time in all tested cases. In particular, the speedup of our algorithm achieves at 6 times while that of MR-DBSCAN just achieves at 4 times when the number of nodes increases from 2 to 9. However, the speedup of our Cludoop also decreases as $n d$ increases similar to MR-DBSCAN. This may be because more border points or noise would be sent to reducers in shuffle phase when launching more mappers, spending more network cost, and merging time on reducers, although the increased parallel mappers reduce the preclustering time.

Figure 9

Varying number of nodes $n d$ on Taxi data.

6.3.3. Varying Neighbor Range Threshold $E p s$

Finally, we analyze the sensitivity of our algorithm with respect to the important parameter $E p s$ on both 10% * Lsyn and Taxi datasets. We vary $E p s$ from 0.0005 to 0.003 on the 10% * Lsyn data when $M i n P t s$ is fixed to 200 and vary $E p s$ from 100 to 300 meters on Taxi data at fixed $M i n P t s = 4$ . From Figures 10 and 11, our proposed Cludoop shows outstanding superiority on time consumption compared to MR-DBSCAN with respect to $E p s$ . This is because (1) our algorithm reduces the cost of maintaining cell index as $E p s$ increases and (2) more cells could be directly determined to be inclusive cells at a larger $E p s$ value when $M i n P t s$ is fixed, saving much network cost in shuffle and merging time in reducer. Whereas larger $E p s$ value leads to more duplicate points, costing more time in Shuffle and Reduce phase for MR-DBSCAN, larger $E p s$ is more likely to lead to imbalance workload in reducer due to its data partition being based on $E p s$ . However, very large $E p s$ value actually is meaningless for density-based clustering when $M i n P t s$ is fixed, since it can not return the meaningful clusters.

Figure 10

Varying neighbor range $E p s$ on 10% ∗ Lsyn data.

Figure 11

Varying neighbor range $E p s$ on Taxi data.

7. Conclusion

Density-based clustering for increasing big data applications is very important yet difficult task. This paper proposes an efficiency and load-balanced distributed density-based clustering for big data using existing data partition on Hadoop platform. Our algorithm incorporates a proposed serial clustering with c-cluster as plug-in on mapper and a Merging-Refinement-Merging 3-step framework to fast merge c-cluster on reducer. The empirical study on large-scale real and synthetic data shows that our Cludoop algorithm effectively finds the correct clusters and exhibits better scalability and efficiency than the state-of-the-art. An interesting direction for future work is to leverage optimized shuffle mechanism for further improving the workload balancing problem of clustering on Hadoop platform.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (nos. 61403328, 61403329, 61302065, and 61172049), the open project program of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (no. 93K172014K13), the Shandong Provincial Natural Science Foundation (no. ZR2013FM011), and the project of Shandong Province higher educational science and technology program (no. J14LN24).

References

Ester

Kriegel

H.-P.

Sander

A density-based algorithm for discovering clusters in large spatial databases with noise

Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining(SIGKDD ′96)

1996

Portland, Ore, USA

226 231

Ankerst

Breunig

Kriegel

H.-P.

Sander

OPTICS: ordering points to identify the clustering structure

Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ′99)

1999

Philadelphia, Pa, USA

Hinneburg

Keim

D. A.

An efficient approach to clustering in large multimedia databases with noise

Proceedings of the 4th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD ′98)

September 1998

New York, NY, USA

58 65

Sander

Ester

Kriegel

H.-P.

Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications

Data Mining and Knowledge Discovery 1998 2 2 169 194

10.1023/a:1009745219419

2-s2.0-22044455069

Shim

MapReduce algorithm for big data analysis

Proceedings of the International Conference on Very Large Data Bases (VLDB ′12)

2012

Istanbul, Turkey

2016 2017

Jäger

Kriegel

H. P.

A fast parallel clustering algorithm for large spatial databases

Data Mining and Knowledge Discovery 1999 3 3 263 290

2-s2.0-22844454592

10.1023/a:1009884809343

Januzaj

Kriegel

H.-P.

Pfeifle

DBDC: density based distributed clustering

Proceedings of the International Conference on Extending Database Technology (EDBT ′04)

March 2004

Heraklion, Greece

88 105

Januzaj

Kriegel

H.-P.

Pfeifle

Scalable density-based distributed clustering

Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD '04)

2004

Heraklion, Greece

88 105

Dash

Petrutiu

Scheuermann

Ecient parallel hierarchical clustering

Euro-Par 2004 Parallel Processing 2004 3149

Berlin, Germany

Springer

363 371 Lecture Notes in Computer Science

10.1007/978-3-540-27866-5_47

10.

Brecheisen

Kriegel

H.-P.

Pfeifle

Parallel density-based clustering of complex objects

Advances in Knowledge Discovery and Data Mining: 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12, 2006. Proceedings 2006 3918

Berlin, Germany

Springer

179 188 Lecture Notes in Computer Science

10.1007/11731139_22

11.

Böhm

Noll

Plant

Wackersreuther

Zherdin

Data mining using graphics processing units

Transactions on Large-Scale Data- and Knowledge-Centered Systems I 2009 5740

Berlin, Germany

Springer

63 90 Lecture Notes in Computer Science

12.

Böhm

Noll

Plant

Wackersreuther

Density-based clustering using graphics processors

Proceedings of the ACM 18th International Conference on Information and Knowledge Management (CIKM ′09)

November 2009

Hong Kong

661 670

10.1145/1645953.1646038

2-s2.0-74549212607

13.

Andrade

Ramos

Madeira

Sachetto

Ferreira

Roch

G-DBSCAN: a GPU accelerated algorithm for density-based clustering

Proceedings of the International Conference on Computational Science (ICCS '13)

2013

Barcelona, Spain

369 378

14.

Ene

Moseley

Fast clustering using MapReduce

Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ′11)

August 2011

San Diego, Calif, USA

681 689

10.1145/2020408.2020515

2-s2.0-80052666000

15.

Tan

Luo

Mao

Feng

Fan

MR-DBSCAN: an efficient parallel density-based clustering algorithm using MapReduce

Proceedings of the IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS ′11)

December 2011

Tainan, Taiwan

473 480

10.1109/icpads.2011.83

2-s2.0-84863032158

16.

Dai

B.-R.

Lin

I.-C.

Efficient map/reduce-based DBSCAN algorithm with optimized data partition

Proceedings of the IEEE 5th International Conference on Cloud Computing (CLOUD ′12)

June 2012

Honolulu, Hawaii, USA

IEEE

59 66

10.1109/cloud.2012.42

2-s2.0-84866768237

17.

Tan

Luo

Feng

Fan

MR-DBSCAN: a scalable MapREDuce-based DBSCAN algorithm for heavily skewed data

Frontiers in Computer Science 2014 8 1 83 99

10.1007/s11704-013-3158-3

MR3151591

2-s2.0-84892438289

18.

Cordeiro

R. L. F.

Traina

Jr. Traina

A. J. M.

López

Kang

Faloutsos

Clustering very large multi-dimensional datasets with MapReduce

Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ′11)

August 2011

San Diego, Calif, USA

690 698

10.1145/2020408.2020516

2-s2.0-80052686089

19.

Xing

W.-Y.

Geolife: a collaborative social networking service among user, location and trajectory

IEEE Data Engineering Bulletin 2010 33 2 32 40

20.

Yuan

Zheng

Xie

Sun

T-drive: enhancing driving directions with taxi drivers' intelligence

IEEE Transactions on Knowledge and Data Engineering 2013 25 1 220 232

10.1109/tkde.2011.200

2-s2.0-84870471446

Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop

Abstract

1. Introduction

2. Related Work

3. Preliminaries and Problem Statement

3.1. Definitions and Theorems

Definition 1 ( E p s -range).

Definition 2 ( E p s -neighbor).

Definition 3 (core point).

Definition 4 (border point).

Definition 5 (isolated point).

Definition 6 (directly density reachable).

Definition 7 (density reachable).

Definition 8 (density connected).

Definition 9 (density-based cluster (d-cluster)).

Lemma 10.

Proof.

Definition 11 (cell).

Lemma 12.

Proof.

Corollary 13.

Lemma 14.

Proof.

Definition 15 (cell-cluster (c-cluster)).

Definition 16 (exclusive cell).

Lemma 17.

Proof.

Lemma 18.

Proof.

3.2. Problem Statement

4. Proposed Serial Clustering with c-Cluster

Algorithm 1: Clustering with c-cluster (CluC) method.

5. Cludoop: Distributed Density-Based Clustering on Hadoop

5.1. Cludoop Framework

5.2. Preclustering on Mapper

5.3. Merging-Refinement-Merging Framework on Reducer

Merging 1.

Merging 2.

Refinement 1.

Refinement 2.

Refinement 3.

Algorithm 2: Merging-Refinement-Merging framework on reducer.

6. Experimental Study

6.1. Experimental Setup and Methodologies

6.2. Effectiveness Evaluation

6.3. Efficiency Evaluation

6.3.1. Varying Volume of Dataset D

6.3.2. Varying Number of Nodes n d

6.3.3. Varying Neighbor Range Threshold E p s

7. Conclusion

Footnotes

Conflict of Interests

Acknowledgments

References

Definition 1 ( $E p s$ -range).

Definition 2 ( $E p s$ -neighbor).

6.3.2. Varying Number of Nodes $n d$

6.3.3. Varying Neighbor Range Threshold $E p s$