Abstract
This paper proposes VDStream, a new effective method, to discover arbitrary shape clusters over variable density data streams. The algorithm can reduce the influence of history data and effectively eliminate the interference of noise data. When the density of data streams changes, VDStream can dynamically adjust the parameters of density to find precise clusters. Experiments demonstrate the effectiveness and efficiency of VDStream.
Introduction
In recent years, with the development of network technology and information technology, data streams as a new data model have appeared in many application fields, such as web click streams, traffic monitoring and management, sensor networks and intrusion detection. Data stream is continuously produced according to the time sequence and changes rapidly with different update rate. An enormous number of streaming data are generated in sequence without boundary.
Due to the rapid increase of data stream application, many methods about clustering data stream have been proposed.1–10 Considering the characteristics of massive data and high-speed change of data stream, data stream clustering should meet the following requirements 11 : (1) compression expression, (2) judging the outliers quickly and (3) processing new data incrementally and rapidly.
Related work
In order to effectively discover the clusters in the data stream, lot of methods have been proposed.
LocalSearch algorithm has been proposed by Guha et al. 6 based on the divide-and-conquer strategy. STREAM algorithm is a method which uses SSQ (sum of squared distance) for the evaluation of clustering quality. 7 The two algorithms can cluster all history data effectively in the limited time and space. However, influence of the old history data in evolving data streams is ignored.
DBSCAN can discover clusters of arbitrary shape and the cluster size can be different. But it is sensitive to the parameters because it requires the user to define the parameters of radius ɛ and minimum points (MinPts). Moreover, the limitation of memory is not considered. 8
Aggarwal et al. 1 proposed the algorithm called CluStream to cluster evolving data stream based on historical and current data. The clustering process adopts the framework including two parts, which are online micro-clustering part and offline macro-clustering part. Summary information of data streams is calculated and stored in micro-clusters and the increment maintenance of micro-cluster is based on Pyramid time frame model. Macro-clusters are generated offline according to the need of users. CluStream can generate spherical clusters effectively, but arbitrary shape clusters ineffectively. Furthermore, CluStream predefines the number of clusters. A new arriving data point is absorbed by an existing cluster, if it belongs to the existing cluster. Otherwise, a new cluster is established for which center is the new data point. If the memory is limited and clusters are full, it needs to remove the least recently used cluster or merge two existing clusters. If the point in the new cluster is outliers, the precision of clustering will be reduced.
DenStream is proposed for clustering evolving data streams. 2 It generates clusters of arbitrary shape based on density and is insensitive to noise. Although it need not predefine cluster number, it requires the user to define ɛ and MinPts. Furthermore, the quality of clusters is not good over variable density or low-density data streams.
Cluster A and cluster B are the two clusters hidden in the noise shown in Figure 1. The density of the noise around cluster A is equal to the density of cluster B. If the threshold of density is high enough, cluster A can be found and the points around cluster A are regarded as noise. If the threshold of density is low enough, cluster B can be found. However, the cluster A and the points around cluster A are regarded as a cluster. Obviously, it cannot produce ideal results by using density-based clustering method with static parameters in variable density data streams.
Clusters in variable density dataset.
This paper uses SNN method (Shared Nearest Neighbor) to dynamically define the parameters of density.
12
As shown in Figure 2, point A and point B are in the 8-nearest-neighbor of each other, where four points are shared. Then, the similarity of point A and point B is 4. Likewise, the similarity of point B and point C is 5. Because the similarity between points is usually 0, a sparse graph is used to represent the similarity of points.
SNN similarity.
This paper proposes a novel method named VDStream. Two parts of clustering strategy are adopted similar to CluStream and DenStream, which are online micro-clustering part and offline macro-clustering part. We use density-based clustering method to generate micro-clusters online and adopt pruning strategy to reduce the need of memory. The SNN similarity is calculated to redefine the parameters of density if the density of data streams change.
Fundamental concepts
Each data point can be saved in the static data processing. However, due to the large amount of data and memory limitations, it is not possible to find the accurate clusters over data streams. We try to find approximate cluster.
With the boundless data flowing, users are more interested in the new data than the old data. With the passage of time, the influence of the data points in clusters becomes more and more weak. Suppose that the data point
The center of a-micro-cluster is
For the two non-overlapping c-micro-clusters C1 and C2, which clustering features are CFC1 and CFC2, respectively, clustering feature is CFC1 + CFC2 if C1 and C2 merge. If point p is absorbed by C1 at time t1, CFC1 =
VDStream
The part of online micro-clustering is limited by space and time. For a new arrival point p, it can be absorbed by an exiting micro-cluster c if p belongs to c. If p do not belongs to any micro-cluster, p cannot be abandoned because point p may be noise or a first point of a new a-micro-cluster. Clustering features are stored on the hard disk at time intervals. Meaningful clusters can be generated offline according to the need of users.
Maintenance of micro-clusters
Initially, for the data points arrived at time interval [0, T], SNN calculates and finds connected branches. The detailed procedure is described in Algorithm 1. For the points in a connected branch, a a-micro-cluster is generated if the number of points in the branch is more than or equal to MinPts. The value of ɛ is specified as the current average radius of a-micro-clusters. Other branches and each isolated point form c-micro-clusters, respectively. SNN graph is sparse. The space complexity is O(km), the time complexity is O(mlogm) and m is the number of points.
Find k-nearest-neighbor for all points in data stream dc;
Sim(p,q) = 0;
Sim(p,q) = the number of k-nearest-neighbor points shared by p and q;
Construct the similarity graph;
Generate a-micro-cluster;
Computer the radius r of a-micro-cluster;
Find the a-micro-cluster Ca nearest to p;
Insert p into Ca;
Update Ca;
find the c-micro-cluster Cc nearest to p;
build new c-micro-cluster and absorb p;
Insert p into Cc;
Update Cc;
move Cc from CFC to CFA;
After that, it is processed as follows when a new data point q arrives at time t. The procedure is shown in Algorithm 2.
First, find the a-micro-cluster Ca nearest to q. Suppose that clustering feature of Ca is CFA = If the new radius of Ca is higher than ɛ, we try to add q to the nearest c-micro-cluster Cc for which clustering feature is CFC = If the new radius of Cc is lower than or equal to ɛ, we insert q into Cc. The new weight of Cc is
The state of a-micro-clusters and c-micro-clusters may be changed as follows.
For the a-micro-clusters Ca, the weight
In order to remove the fading a-micro-clusters from CFA tree in time, the weight of all a-micro-clusters needs to checked at time intervals. We suppose that the number of points in a a-micro-clusters is n and the checking frequency per unit time is m. Then for the a-micro-clusters, the time complexity of check is O(mn). Fortunately, we find the real check time for a-micro-clusters to reduce the time complexity. We suppose that at time ta, the weight of a-micro-clusters Ca is
If Two close c-micro-clusters may grow into a a-micro-cluster by merging them. If the value ɛ between two c-micro-clusters is different, then they cannot be merged because the density of them varies.
Suppose that c-micro-cluster CFC1 and c-micro-cluster CFC2 are merged, the new cluster is CFC = CFC1 + CFC2. If the new weight ρ of CFC is higher than or equal to MinPts and the new radius is lower than or equal to ɛ, CFC meet the conditions of a-micro-cluster. Then, we insert the new micro-cluster CFC into the CFA tree and remove CFC1 and CFC2 from the CFC tree.
In order to reduce the processing time, we sort all the c-micro-clusters according to ρ. The higher ρ is, the higher processing priority is. When ρ of the current processing c-micro-cluster is lower than
sort c-micro-clusters according to ρ;
Find c-micro-clusters Cc1 with biggest ρ;
//the weight of Cc1 is ρ1
Find its nearest neighbor Cc2 which the weight
Try to merge Cc1 and Cc2;
insert the new micro-cluster into CFA;
remove Cc1 and Cc2 from CFC;
} Due to the limit of space, c-micro-clusters must be deleted periodically.
The old and the low-compact degree c-micro-clusters should be deleted. For a c-micro-cluster CFC =
Density change detection
Some nature cluster may be ignored in variable density data stream if ɛ is static. In order to improve the precision of cluster, different ɛ is determined using SNN calculation according to different density of data stream. The density may be changed many times and various ɛ values are obtained. We build various CF trees according to various ɛ. If the difference of ɛ between two CF trees is small enough, they can take the same value. The process is shown in Algorithm 4.
Initially, ɛ is determined using SNN calculation. When the density changes as follows, ɛ is recalculated using SNN. The procedure is shown in Algorithm 5.
If the data stream changes from high density to low density, most of the new points are absorbed by new c-micro-cluster in a period of [T-W, T]. However, few new points are absorbed by existing clusters. We adopt the method proposed by Watanabe
13
to check the change. Suppose that in a period of [T-W, T], num points arrive and sum points are absorbed by existing clusters. Then, there is no enough points absorbed by existing clusters if If the data stream changes from low density to high density, the radius r of new a-micro-cluster is abnormally low. We consider that the density changes if r <
Insert the new cluster into the CF tree
Scan the points in [T-W,T];
SNN(dc,k);
Recalculate(ɛ,τ)
SNN(dc,k);
Recalculate(ɛ,τ);
The whole process of clustering variable density data stream is shown in Algorithm 6.
Scan the points of DS in [0,T];
move a-micro-cluster from CFA to CFC;
Delete the c-micro-cluster with lowest δ;
generating clusters;
}
Generating clusters
The summary of information is maintained online. Clusters can be generated from a-micro-cluster offline according to the need of users.
The offline clustering part executes based on density-reachable. The micro-clusters are considered to be noise if they do not belong to any cluster.
Experiments
We compare the effectiveness and efficiency of VDStream, DenStream and CluStream through experiments. Experiments are performed in the Visual C++ on a Pentium D 2.8 GHz PC, for which the operating system is Windows 7.
This paper adopts KDD-CUP’99 network intrusion detection dataset for experiment. Data stream is generated according to the input sequence of the dataset records with the speed of 1000 data points arriving per unit time. The parameters in VDStream algorithm are set as follows: MinPts = 10, k = 4, λ = 0.2, τ = 0.25, β = 0.5, γ = 0.1 and η = 0.25.
The square distance SSQ is used as the evaluation criterion for the clustering results.
The SSQ comparison among VDStream, DenStream and CluStream is shown in Figure 3. The clustering quality of CluStream is lower than VDStream and DenStream because it is influenced by noise points.
SSQ comparison.
The processing time of VDStream, DenStream and CluStream is shown in Figure 4. The processing time of CluStream is longer than VDStream and DenStream because it stores a snapshot at each time scale.
Processing time vs. length of data stream.
Due to the small density difference in KDD-CUP’99 dataset, the processing time of VDStream is not longer than DenStream. The calculation of SNN similarity is implemented initially. Obviously, VDStream has the advantage of the processing time and the quality clustering over relatively stable density data stream.
In order to compare the clustering results of VDStream and DenStream over variable density data stream, synthetic datasets are used as shown in Figure 5. The clustering results generated by VDStream are shown in Figure 6. The clustering results generated by DenStream are shown in Figure 7. DenStream lost a meaningful cluster due to the dependence on the parameter ɛ.
Original dataset. Clustering on D1 by VDStream. Clustering on D1 by DenStream.


Conclusion
In this paper, we propose VDStream, a clustering algorithm over variable density data streams. VDStream is less insensitive to noise than CluStream, and it can discover clusters of arbitrary shape. In contrast with DenStream, VDStream can dynamically adjust the parameters of ɛ by using SNN similarity calculation. Thus, it is more accurate to find clusters over variable density or low-density data streams.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Shandong Province Natural Science Fund under Grant (ZR2013DM011), Shandong Economic and Social Information of the Soft Science Research (2015E1017), Shandong Economic and Social Information of the Soft Science Research (2015E1018), Tai’an Science and Technology Development Plan (201430774) and ShanDong University of Science and Technology Research Team under Grant (2013KYTD04).
