An arbitrary shape clustering algorithm over variable density data streams

Abstract

This paper proposes VDStream, a new effective method, to discover arbitrary shape clusters over variable density data streams. The algorithm can reduce the influence of history data and effectively eliminate the interference of noise data. When the density of data streams changes, VDStream can dynamically adjust the parameters of density to find precise clusters. Experiments demonstrate the effectiveness and efficiency of VDStream.

Keywords

Data streams clustering variable density

Introduction

In recent years, with the development of network technology and information technology, data streams as a new data model have appeared in many application fields, such as web click streams, traffic monitoring and management, sensor networks and intrusion detection. Data stream is continuously produced according to the time sequence and changes rapidly with different update rate. An enormous number of streaming data are generated in sequence without boundary.

Due to the rapid increase of data stream application, many methods about clustering data stream have been proposed.^1–10 Considering the characteristics of massive data and high-speed change of data stream, data stream clustering should meet the following requirements¹¹: (1) compression expression, (2) judging the outliers quickly and (3) processing new data incrementally and rapidly.

Related work

In order to effectively discover the clusters in the data stream, lot of methods have been proposed.

LocalSearch algorithm has been proposed by Guha et al.⁶ based on the divide-and-conquer strategy. STREAM algorithm is a method which uses SSQ (sum of squared distance) for the evaluation of clustering quality.⁷ The two algorithms can cluster all history data effectively in the limited time and space. However, influence of the old history data in evolving data streams is ignored.

DBSCAN can discover clusters of arbitrary shape and the cluster size can be different. But it is sensitive to the parameters because it requires the user to define the parameters of radius ɛ and minimum points (MinPts). Moreover, the limitation of memory is not considered.⁸

Aggarwal et al.¹ proposed the algorithm called CluStream to cluster evolving data stream based on historical and current data. The clustering process adopts the framework including two parts, which are online micro-clustering part and offline macro-clustering part. Summary information of data streams is calculated and stored in micro-clusters and the increment maintenance of micro-cluster is based on Pyramid time frame model. Macro-clusters are generated offline according to the need of users. CluStream can generate spherical clusters effectively, but arbitrary shape clusters ineffectively. Furthermore, CluStream predefines the number of clusters. A new arriving data point is absorbed by an existing cluster, if it belongs to the existing cluster. Otherwise, a new cluster is established for which center is the new data point. If the memory is limited and clusters are full, it needs to remove the least recently used cluster or merge two existing clusters. If the point in the new cluster is outliers, the precision of clustering will be reduced.

DenStream is proposed for clustering evolving data streams.² It generates clusters of arbitrary shape based on density and is insensitive to noise. Although it need not predefine cluster number, it requires the user to define ɛ and MinPts. Furthermore, the quality of clusters is not good over variable density or low-density data streams.

Cluster A and cluster B are the two clusters hidden in the noise shown in Figure 1. The density of the noise around cluster A is equal to the density of cluster B. If the threshold of density is high enough, cluster A can be found and the points around cluster A are regarded as noise. If the threshold of density is low enough, cluster B can be found. However, the cluster A and the points around cluster A are regarded as a cluster. Obviously, it cannot produce ideal results by using density-based clustering method with static parameters in variable density data streams.

Figure 1.

Clusters in variable density dataset.

This paper uses SNN method (Shared Nearest Neighbor) to dynamically define the parameters of density.¹² As shown in Figure 2, point A and point B are in the 8-nearest-neighbor of each other, where four points are shared. Then, the similarity of point A and point B is 4. Likewise, the similarity of point B and point C is 5. Because the similarity between points is usually 0, a sparse graph is used to represent the similarity of points.

Figure 2.

SNN similarity.

This paper proposes a novel method named VDStream. Two parts of clustering strategy are adopted similar to CluStream and DenStream, which are online micro-clustering part and offline macro-clustering part. We use density-based clustering method to generate micro-clusters online and adopt pruning strategy to reduce the need of memory. The SNN similarity is calculated to redefine the parameters of density if the density of data streams change.

Fundamental concepts

Definition 1 (atom point). A point p is defined as an atom point if the number of points in ɛ neighborhood of p is higher than or equal to MinPts.

Definition 2 (density area). ɛ neighborhood of an atom point p is called density area of p.

Each data point can be saved in the static data processing. However, due to the large amount of data and memory limitations, it is not possible to find the accurate clusters over data streams. We try to find approximate cluster.

With the boundless data flowing, users are more interested in the new data than the old data. With the passage of time, the influence of the data points in clusters becomes more and more weak. Suppose that the data point $p i j$ arrives at time $T i j$ , then a fading function is defined as $f (Δ t i j) = 2^{- λ Δ t i j}$ to show the fading degree at current time t, where $λ > 0$ and $Δ t i j = t - T j$ . The higher the value of λ, the smaller influence of the historical data. For the given λ, the higher the value of $Δ t$ , the smaller influence on the cluster. If a data point in cluster becomes weak, the center of the cluster will move to the opposite direction. The center of cluster is defined as $c = \sum_{j = 1}^{n} f (Δ t i j) p i j / \sum_{j = 1}^{n} f (Δ t i j)$ .

Definition 3 (Atom micro-cluster). At time t, for a group of data points $p i 1, p i 2, \dots p i n$ with time stamps $T i 1, T i 2, \dots T i n$ , an atom micro-cluster (a-micro-cluster) is defined as CFA = $(CF 2, CF 1, ɛ, ρ, t a, t check)$ . At time t, the value of a fading point $p i j$ with time stamp $T i j$ is $f (Δ t i j) p i j$ . CF1 denotes the linear sum of the fading points defined as $CF 1 = \sum_{j = 1}^{n} f (Δ t i j) p i j$ . $CF 2 = \sum_{j = 1}^{n} f (Δ t i j) p_{i j}^{2}$ is the squared sum of the fading points. $ρ = \sum_{j = 1}^{n} f (Δ t i j)$ is the weight at time t_a, $ρ \geq MinPts$ . t_a denotes the last update time of a-micro-cluster. $t check$ denotes the check time of a a-micro-cluster decaying to a c-micro-cluster.

The center of a-micro-cluster is $c = \frac{CF 1}{ρ}$ . $r = \sqrt{\frac{CF 2}{ρ} - (\frac{CF 1}{ρ})^{2}}$ is the radius of a-micro-cluster, $r \leq ɛ$ . The value of ɛ is dynamically determined by SNN similarity calculation. It is different in different density data streams.

Definition 4 (Candidate micro-cluster). At time t, for a group of data points $p i 1, p i 2, \dots p i n$ with time stamps $T i 1, T i 2, \dots T i n$ , a Candidate micro-cluster (c-micro-cluster) is defined as CFC = $(CF 2, CF 1, ɛ, ρ, t c)$ , $ρ < MinPts$ .

For the two non-overlapping c-micro-clusters C₁ and C₂, which clustering features are CFC₁ and CFC₂, respectively, clustering feature is CFC1 + CFC2 if C₁ and C₂ merge. If point p is absorbed by C₁ at time t₁, CFC1 = $(CF 2 + p^{2}, CF 1 + p, ɛ, ρ + 1, t 1)$ . The addition and subtraction of feature tree make the maintenance of micro-clusters easy and not the mass memory.

VDStream

The part of online micro-clustering is limited by space and time. For a new arrival point p, it can be absorbed by an exiting micro-cluster c if p belongs to c. If p do not belongs to any micro-cluster, p cannot be abandoned because point p may be noise or a first point of a new a-micro-cluster. Clustering features are stored on the hard disk at time intervals. Meaningful clusters can be generated offline according to the need of users.

Maintenance of micro-clusters

Initially, for the data points arrived at time interval [0, T], SNN calculates and finds connected branches. The detailed procedure is described in Algorithm 1. For the points in a connected branch, a a-micro-cluster is generated if the number of points in the branch is more than or equal to MinPts. The value of ɛ is specified as the current average radius of a-micro-clusters. Other branches and each isolated point form c-micro-clusters, respectively. SNN graph is sparse. The space complexity is O(km), the time complexity is O(mlogm) and m is the number of points.

Algorithm 1. SNN(dc,k,MinPts)

Find k-nearest-neighbor for all points in data stream dc;

if point p and point q are not in k-nearest-neighbor of each other then

Sim(p,q) = 0;

else

Sim(p,q) = the number of k-nearest-neighbor points shared by p and q;

end if

Construct the similarity graph;

for each connected branch do

if the number of points ≥ $MinPts$ then

Generate a-micro-cluster;

Computer the radius r of a-micro-cluster;

end if

end for

$ɛ = AVERAGE (r) .$

Algorithm 2. Disposepoint (p)

Find the a-micro-cluster C_a nearest to p;

If r (C_a + p) ≤ ɛ then

Insert p into C_a;

Update C_a;

else

find the c-micro-cluster C_c nearest to p;

if r (C_c + p) ≥ ɛ then

build new c-micro-cluster and absorb p;

else

Insert p into C_c;

Update C_c;

if $ρ$ ≥ MinPts then

move C_c from CFC to CFA;

end if

end if.

After that, it is processed as follows when a new data point q arrives at time t. The procedure is shown in Algorithm 2.

First, find the a-micro-cluster C_a nearest to q. Suppose that clustering feature of C_a is CFA = $(CF 2, CF 1, ɛ, ρ, t a, t check)$ , where $t check = t a - (\log MinPts / ρ 1) / λ$ . Try to absorb point q into C_a. Then, clustering feature of C_a is CFA = $(CF 2 + q^{2}, CF 1 + q, ɛ, ρ \times f (t - t a) + 1, t, t check)$ , where $t check = t - (\log \frac{MinPts}{ρ \times f (t - t a)}) / λ$ . Point q will be really absorbed by C_a if and only if the new radius of C_a is lower than or equal to ɛ.

If the new radius of C_a is higher than ɛ, we try to add q to the nearest c-micro-cluster C_c for which clustering feature is CFC = $(CF 2, CF 1, ɛ, ρ, t c)$ . Then, the clustering feature of C_c is $CFC = (CF 2 + q^{2}, CF 1 + q, ɛ, ρ \times$ $f (t - t c) + 1, t)$ . If the new radius of C_c is higher than ɛ, we create a new c-micro-cluster to absorb q. CFC updates only when a new point is inserted into it.

If the new radius of C_c is lower than or equal to ɛ, we insert q into C_c. The new weight of C_c is $ρ = ρ \times f (t - t c) + 1$ . We consider that the c-micro-cluster C_c grows into a a-micro-cluster if the new weight $ρ$ is higher than or equal to MinPts. Then, we move C_c from the CFC tree to the CFA tree.

The state of a-micro-clusters and c-micro-clusters may be changed as follows.

For the a-micro-clusters C_a, the weight $ρ$ will be reduced gradually if no new points are absorbed by C_a over a period of time. When the weight $ρ$ is lower than MinPts, C_a will be moved from the CFA tree into the CFC tree.

In order to remove the fading a-micro-clusters from CFA tree in time, the weight of all a-micro-clusters needs to checked at time intervals. We suppose that the number of points in a a-micro-clusters is n and the checking frequency per unit time is m. Then for the a-micro-clusters, the time complexity of check is O(mn). Fortunately, we find the real check time for a-micro-clusters to reduce the time complexity. We suppose that at time t_a, the weight of a-micro-clusters C_a is $ρ 1 = \sum_{j = 1}^{n} f (t a - t j)$ , where the point p_j arrives at time t_j, $1 \leq j \leq n$ . If C_a does not absorb any new point in a period [t_a_, t_c], at time t_c, the weight of a-micro-clusters C_a is

ρ 2 = \sum_{j = 1}^{n} f (t a - t j) f (Δ t ca) = ρ 1 \times f (Δ t ca)

where

f (Δ t ca) = 2^{- λ Δ t ca}

Δ t ca = t c - t a

If $ρ 2$ is lower than MinPts, C_a fades into c-micro-cluster. That is, $ρ 1 \times 2^{- λ Δ t ca} < MinPts$ . Then, if $Δ t ca > - (\log \frac{MinPts}{ρ 1}) / λ$ , that is the current time $t \geq t a - (\log \frac{MinPts}{ρ 1}) / λ$ , we should check the a-micro-cluster. We set $t check = t a - (\log \frac{MinPts}{ρ 1}) / λ$ . Then for all a-micro-clusters which meet the condition of $t check \leq t$ , we move the C_a from the CFA tree to the CFC tree. The time complexity of check is only O(m), which is much lower than O(mn).

Two close c-micro-clusters may grow into a a-micro-cluster by merging them. If the value ɛ between two c-micro-clusters is different, then they cannot be merged because the density of them varies.

Suppose that c-micro-cluster CFC₁ and c-micro-cluster CFC₂ are merged, the new cluster is CFC = CFC₁ + CFC₂. If the new weight ρ of CFC is higher than or equal to MinPts and the new radius is lower than or equal to ɛ, CFC meet the conditions of a-micro-cluster. Then, we insert the new micro-cluster CFC into the CFA tree and remove CFC₁ and CFC₂ from the CFC tree.

In order to reduce the processing time, we sort all the c-micro-clusters according to ρ. The higher ρ is, the higher processing priority is. When ρ of the current processing c-micro-cluster is lower than $\frac{MinPts}{2}$ , the process ends, because the ρ of a new micro-cluster merged by two c-micro-clusters must be lower than MinPts if ρ of two c-micro-clusters is both lower than $\frac{MinPts}{2}$ . The process is shown in Algorithm 3.

Algorithm 3. Merge( )

sort c-micro-clusters according to ρ;

do{

Find c-micro-clusters C_c1 with biggest ρ;

//the weight of C_c1 is ρ₁

Find its nearest neighbor C_c2 which the weight $ρ 2$ ≥ MinPts-ρ₁;

Try to merge C_c1 and C_c2;

if new r ≤ ɛ then

insert the new micro-cluster into CFA;

remove C_c1 and C_c2 from CFC;

}while (ρ < $\frac{MinPts}{2}$ );

Due to the limit of space, c-micro-clusters must be deleted periodically.

The old and the low-compact degree c-micro-clusters should be deleted. For a c-micro-cluster CFC = $(CF 2, CF 1, ɛ, ρ, t c)$ , the c-micro-cluster is older if $Δ t = t - t c$ is higher, where t is the current time. The compact degree of c-micro-clusters is determined by weight ρ and radius r. However, r of a-micro-cluster in low-density data stream may be larger than that in high-density data stream. We adopt $δ = \frac{ɛ ρ}{r^{2}} f (Δ t)$ as the criterion for deletion. The value of $ρ \times f (Δ t)$ is low if the points are relatively old and few. The value of $\frac{ɛ}{r^{2}}$ is low if the points in micro-cluster are sparse. We select the c-micro-clusters with lowest δ to delete.

Density change detection

Some nature cluster may be ignored in variable density data stream if ɛ is static. In order to improve the precision of cluster, different ɛ is determined using SNN calculation according to different density of data stream. The density may be changed many times and various ɛ values are obtained. We build various CF trees according to various ɛ. If the difference of ɛ between two CF trees is small enough, they can take the same value. The process is shown in Algorithm 4.

Initially, ɛ is determined using SNN calculation. When the density changes as follows, ɛ is recalculated using SNN. The procedure is shown in Algorithm 5.

If the data stream changes from high density to low density, most of the new points are absorbed by new c-micro-cluster in a period of [T-W, T]. However, few new points are absorbed by existing clusters. We adopt the method proposed by Watanabe¹³ to check the change. Suppose that in a period of [T-W, T], num points arrive and sum points are absorbed by existing clusters. Then, there is no enough points absorbed by existing clusters if $\frac{sum}{num} < (1 \pm γ) η$ , where $0 < γ, η < 1$ . ɛ does not fit to the density of current data stream and it needs to be recalculated.

If the data stream changes from low density to high density, the radius r of new a-micro-cluster is abnormally low. We consider that the density changes if r < $ɛ β$ , where $0 < β < 1$ .

Algorithm 4. Recalculate(ɛ,τ)

for each CF tree do

if ɛ = $ɛ c$ (1 ± τ) then

$ɛ = ɛ c$ ;

Insert the new cluster into the CF tree

end if

Algorithm 5. Dencheck $(γ, η, β)$

Scan the points in [T-W,T];

if $\frac{sum}{num} < (1 \pm γ) η$ then

SNN(dc,k);

Recalculate(ɛ,τ)

else

if r < $ɛ β$ then

SNN(dc,k);

Recalculate(ɛ,τ);

end if

The whole process of clustering variable density data stream is shown in Algorithm 6.

Algorithm 6. VDstream(DS,λ)

Scan the points of DS in [0,T];

SNN(dc,k);

Do {

for each new point arrive at [T,T + W] do

Disposepoint (p);

end for

for each a-micro-cluster do

if $t check \leq t$ then

move a-micro-cluster from CFA to CFC;

end if

end for

if t = T + W then

Dencheck( $γ, η, β$ );

Merge();

Do{

Delete the c-micro-cluster with lowest δ;

}while(meet the memory);

end if

if a clustering request comes then

generating clusters;

end if

}while(.T.)

Generating clusters

The summary of information is maintained online. Clusters can be generated from a-micro-cluster offline according to the need of users.

Definition 5 (directly density-reachable).A-micro-cluster C_a is directly density-reachable from a-micro-cluster C_q, if the distance between centers of C_a and C_q is lower than or equal to $ɛ p + ɛ q$ , where $ɛ p$ and $ɛ q$ are determined by SNN.

Definition 6 (density-reachable). A-micro-cluster C_p is density-reachable from a-micro-cluster C_q, if there is a chain of a-micro-clusters $C 1, C 2, \dots, C n$ , where $C 1 = C p$ , $C n = C q$ .

Definition 7 (density-connected). A-micro-cluster C_p is density-connected from a-micro-cluster C_q, if both C_p and C_q are density-reachable from a-micro-cluster C_o.

The offline clustering part executes based on density-reachable. The micro-clusters are considered to be noise if they do not belong to any cluster.

Experiments

We compare the effectiveness and efficiency of VDStream, DenStream and CluStream through experiments. Experiments are performed in the Visual C++ on a Pentium D 2.8 GHz PC, for which the operating system is Windows 7.

This paper adopts KDD-CUP’99 network intrusion detection dataset for experiment. Data stream is generated according to the input sequence of the dataset records with the speed of 1000 data points arriving per unit time. The parameters in VDStream algorithm are set as follows: MinPts = 10, k = 4, λ = 0.2, τ = 0.25, β = 0.5, γ = 0.1 and η = 0.25.

The square distance SSQ is used as the evaluation criterion for the clustering results.

The SSQ comparison among VDStream, DenStream and CluStream is shown in Figure 3. The clustering quality of CluStream is lower than VDStream and DenStream because it is influenced by noise points.

Figure 3.

SSQ comparison.

The processing time of VDStream, DenStream and CluStream is shown in Figure 4. The processing time of CluStream is longer than VDStream and DenStream because it stores a snapshot at each time scale.

Figure 4.

Processing time vs. length of data stream.

Due to the small density difference in KDD-CUP’99 dataset, the processing time of VDStream is not longer than DenStream. The calculation of SNN similarity is implemented initially. Obviously, VDStream has the advantage of the processing time and the quality clustering over relatively stable density data stream.

In order to compare the clustering results of VDStream and DenStream over variable density data stream, synthetic datasets are used as shown in Figure 5. The clustering results generated by VDStream are shown in Figure 6. The clustering results generated by DenStream are shown in Figure 7. DenStream lost a meaningful cluster due to the dependence on the parameter ɛ.

Figure 5.

Original dataset.

Figure 6.

Clustering on D1 by VDStream.

Figure 7.

Clustering on D1 by DenStream.

Conclusion

In this paper, we propose VDStream, a clustering algorithm over variable density data streams. VDStream is less insensitive to noise than CluStream, and it can discover clusters of arbitrary shape. In contrast with DenStream, VDStream can dynamically adjust the parameters of ɛ by using SNN similarity calculation. Thus, it is more accurate to find clusters over variable density or low-density data streams.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Shandong Province Natural Science Fund under Grant (ZR2013DM011), Shandong Economic and Social Information of the Soft Science Research (2015E1017), Shandong Economic and Social Information of the Soft Science Research (2015E1018), Tai’an Science and Technology Development Plan (201430774) and ShanDong University of Science and Technology Research Team under Grant (2013KYTD04).

References

Aggarwal

Han

Wang

. A framework for clustering evolving data streams. VLDB 2003; 2003: 81–92.

Feng C, Ester M, Weining Q, et al. Density-based clustering over an evolving data stream with noise. In: Proceedings of the sixth SIAM international conference on data mining (SIAM '06), Bethesda, pp.328–339.

Dong-Bo

Gang

Sheng-Li

. Effective clustering algorithm for probabilistic data stream. J Softw 2009; 20: 1313–1328.

Jian-long

Feng

Ao-ying

. Clustering evolving data streams over sliding windows. J Softw 2007; 18: 905–918.

Chen

Che-Qing

Ao-Ying

. Clustering algorithm over uncertain data streams. J Softw 2010; 21: 2173–2182.

Guha

Mishra

Motwani

. Clustering data stream. FOCS 2000; 2000: 359–366.

O’Callaghan

Mishra

Meyerson

. Streaming-data algorithms for high-quality clustering. ICDE Conf 2002; 2002: 685–704.

Ester M, Kriegel H-P, Sander J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd ACM SIGKDD international conference on knowledge discovery and data mining, 1996, Portland, pp.226–231.

Ning

ChangJie

. Clustering algorithm on data stream with distribution based on temporal density. J Softw 2010; 21: 1031–1041.

10.

Zhuo

Yue

. An adaptive grid-density based data stream clustering algorithm based on uncertainty model. J Comp Res Dev 2014; 51: 2518–2527.

11.

Barbar

. Requirements for clustering data streams. SIGKDD Explor 2003; 3: 23–27.

12.

Honghui

Lizhen

Lihua

. pgi-distance: an efficient method supporting parallel KNN-join process. J Comp Res Dev 2007; 44: 1774–1781.

13.

Watanabe

. Simple sampling techniques for discovery science. IEICE Trans Inform Syst 2000; E83-D: 19–26.