Sage Journals: Discover world-class research

Abstract

Subspace clustering, which detects all clusters in affine subspaces of a given high dimensional vector space, is used in various applications, including e-business. The performance and result of a subspace clustering algorithm highly depend on the parameter values the algorithm is tuned to execute. It may not be clear if the resultant clusters are indeed meaningful ones in a given dataset or if the result is just an artifact of the given parameter values. Although choosing the proper parameter values is crucial for both clustering quality and performance of the algorithm, there has been little research or discussion on this topic. In this paper, we propose a methodology for determining proper values of parameters in subspace clustering. Along with it, we validate our approach through experimental analysis, using various real-world datasets. The study can serve as a reference model for any subspace clustering experiment in which parameter setting is required to output clusters of quality.

1. Introduction

Recently, a group of algorithms called “subspace clustering” [1–4] are attracting academic interest for clustering high dimensional data. Clustering is a crucial task that is used in various applications, with the aim of detecting the dense regions of a given dataset, or as a prerequisite step for further processes, such as classification.

Subspace clustering can be widely used in many smart business application areas, which may include, but are not limited to the following [5, 6]. (i)

Product recommendations: the collaborative filtering technique is well known, and popularly used in the domain of product recommendation [7]. If the information conveying which customers have purchased what products is represented in a vector data model, to find out which customers have shown a similar purchase history becomes a subspace clustering problem [8].

(ii)

Smart sensor logs: as electronic devices and storage media become cheaper and small devices such as smart phones become popular, the log information that is collected by using smart sensors is attracting more interest from industry. The log may represent the users' patterns and can be used in product searching or recommendation [9]. The number of sensors and their collecting data can be high and numerous. The log information can also be represented in a vector model.

(iii)

Social network services: many social media sites such as Twitter provide users with a “follow” feature, which enables users to consume their own personalized contents. The user subscription information can be modelled in a high dimensional vector model [9]. Clustering users means to find out a group of users who have similar interests.

However, technological improvements in sensor, transmission, or storage domain have led to a flooding of high dimensional data, of which the dimensionality is typically equal to or greater than 10. With traditional clustering algorithms, which regard all dimensions equally, satisfactory results are hardly obtained, as the distance difference between pairs of data objects collapses. Unlike traditional approaches, subspace clustering has been proposed as an alternative to detect all clusters residing in affine subspaces of a given high dimensional vector space.

Adopting a density-based clustering paradigm [10], a subspace cluster is defined as a connected component of objects where two data objects are considered as “connected,” if and only if the distance between the projections of two objects onto a given affine subspace is not greater than a given bandwidth. One advantage of a density-based clustering paradigm is that it can detect clusters with arbitrary shapes, so too for subspace clustering. For this reason, considerable researches have been published on this topic.

However, most of these algorithms share a critical problem in common, that of parameters. To conduct clustering, a number of parameters should be employed, some of which are as follows: bandwidth (ε), density threshold (τ), minimum cluster size ( $m i n S i z e$ ), and duplication factor. All objects in each subspace cluster must be connected with no less than τ other objects on an associated subspace, and each cluster must have at least $m i n S i z e$ objects. Amongst this clusters, only nonsimilar clusters with respect to a given duplication factor are included in the final results [5].

For the last two parameters, 1% of the whole dataset size and 0.1 have been widely used in multiple works [2, 5]. However, there has been little literature or work on selecting the first two parameters, which heavily affect final clustering results. For example, if the value of ε is too large, the result may include too much noise. In contrast, if the value of ε is too small, we may get lossy results. The opposite situation can occur with regard to the value of τ. Moreover, not only does the selection of parameters impact the quality of clustering results, but also it heavily impacts the efficiency of the algorithm. The running time of the algorithm falls off as the value of ε decreases, and the value of τ increases, as the number of objects and connections that should be considered decreases accordingly. For these reasons, making a choice of adequate parameter values of ( $ε, τ$ ) pair is crucial. If their values are inappropriate, applying a subspace clustering algorithm to a given input will result in poor output or excessive running time, or possibly both.

However, selecting proper parameter values is not a simple task, for preliminary information is not available in common. One possible method may be a trial-and-error approach, which repeatedly conducts clustering tasks with different combinations of parameter values and then finally selects the most satisfactory result. Nevertheless, this approach also has its own limit: as clustering is inherently a computation-intensive task, its running time is typically long, so trying lots of combinations of parameters may not be practical.

In this paper, we propose a parameter-search method based on random sampling. We perform the experiments to present the impact of parameters in subspace clustering and to find out their proper values in the domain. Experimental analysis shows that our approach is reasonable in various real-world datasets.

2. Strategy

To overcome the problems stated above, we propose a simple yet efficient approach. Our search strategy exploits a random sampling approach. It includes the following steps. (1)

Determine the value of the ( $ε, τ$ ) pair that is computationally feasible for a given computation machine, using the full input set. That is, select the largest value of ε and smallest value of τ as far as is possible, so that it can be calculated in a desired timeframe by the given machine. We use $ε_{m a x}$ and $τ_{m i n}$ to denote each value, respectively. Of course, these values may be different with respect to the input.

(2)

Select candidate value pairs. For arbitrary $ε_{m i n}$ and $τ_{m a x}$ , we can choose $s \times t$ candidate pairs $(ε_{1}, τ_{1}), (ε_{1}, τ_{2}) \dots (ε_{m a x}, τ_{m a x})$ with two arithmetic sequences, $ε_{m i n} = ε_{1} < ε_{2} < \dots < ε_{s} = ε_{m a x}$ and $τ_{m i n} = τ_{1} < τ_{2} < \dots < τ_{s} = τ_{m a x}$ .

(3)

Generate a random sample from the full dataset, without replacement.

(4)

Run a given subspace clustering algorithm on the sample set, for all $s \times t$ candidate parameter pairs.

(5)

Compare the results, using quality measurement. Choose the optimal parameter pair $(ε_{o p t}, τ_{o p t})$ , in terms of the best quality.

For the reason indicated above, the running time with $(ε_{o p t}, τ_{o p t})$ on the original full dataset cannot be longer than that of $(ε_{m a x}, τ_{m i n})$ . Therefore, the running time is guaranteed to be less than the given timeframe. One advantage of this approach is twofold: it not only allows trials on various candidates, but also deals with the time limit, which is very common in real-world application.

3. Experimental Setup

To validate our approach, we perform an experiment to check the efficiency of the strategy or whether this method may actually detect adequate parameter values. For the experiment, we use three real-world datasets with different dimensionality and characteristics from the UCI machine learning repository [11]: the Pendigit dataset with 16 dimensionalities and the Cell, Biodeg dataset with 30, 41 dimensionalities, respectively.

From each of these datasets, we generate 3 input sets with different size: the input set with full population is generated by repeatedly selecting 10,000 objects from the original dataset. To normalize different dimensions, all element values are converted into the z-score of the associated dimension. Then, two smaller input sets with 1,000 and 5,000 objects are generated, through random sampling without replacement.

To sum up, we prepare 9 input sets from 3 different original datasets and with 3 different sizes (Table 1).

Table 1

Input datasets.

Input Set	Source	Size ( $\| DB \|$ )
Pendigit-1000	Pendigit	1,000
Pendigit-5000		5,000
Pendigit-10000		10,000

Cell-1000	Cell	1,000
Cell-5000		5,000
Cell-10000		10,000

Biodeg-1000	Biodeg	1,000
Biodeg-5000		5,000
Biodeg-10000		10,000

For algorithm implementation, we use a distributed version of the subspace clustering algorithm introduced in [6] which yields equivalent results to the algorithm introduced in [2], and runs on Hadoop cluster. Our Hadoop cluster consists of 16 commodity machines. For our algorithm to exploit not only MapReduce but also BSP, it uses Apache Giraph [12] for its implementation.

In addition, we operate an additional ZooKeeper cluster, which serves as a distributed memory. The cluster consists of 3 commodity machines. Tables 2 and 3 summarize the hardware specification of each cluster. All nodes within each cluster run on a virtual hardware system provided by DigitalOcean (https://www.digitalocean.com/), with Ubuntu 13.10 × 64 and Oracle Java Runtime Environment version 7, update 40.

Table 2

Specification of Hadoop cluster.

CPU
Core	8
Frequency	2.4 Ghz
L2	2 MB
L3	30 MB
RAM	32 GB
Storage	160 GB SSD

Table 3

Specification of ZooKeeper Cluster.

CPU
Core	4
Frequency	2.4 Ghz
L2	1 MB
L3	15 MB
RAM	16 GB
Storage	80 GB SSD

Using the settings in Tables 2 and 3, we compare the clustering results yielded from each input set with different ( $ε, τ$ ) values. To fix the value of $ε_{m a x}$ and $τ_{m i n}$ for each dataset, we use random values for $(ε, τ)$ and select a pair with which the execution of algorithm finishes in about 30 minutes.

Table 4 shows our parameter settings. As each dataset has its classification label, the quality of clustering can be measured. For accuracy measurement [13–15], we use $F 1$ score, as it considers both precision and recall values and is widely used in recent literatures [4, 14, 15]. In Table 4, all candidate values for $(ε, τ)$ satisfy $ε_{m i n} \leq ε \leq ε_{m a x}$ , $τ_{m i n} \leq τ \leq τ_{m a x}$ . As the size of each input set differs, τ represents the percentage to input set size. For example, if the size of the input set is 1000, $τ = 20 %$ means that each object in the clustered results is connected with more than 200 other objects.

Table 4

Parameter setup.

Source	$ε_{\min}$	$ε_{\max}$	$τ_{\min}$	$τ_{\max}$
Pendigit	8.0	16.0	15%	30%
Cell	10.0	20.0	32%	44%
Biodeg	8.0	16.0	55%	70%

4. Results Analysis

Tables 5, 6, 7, 8, 9, 10, 11, 12, and 13 show the $F 1$ score values of clustering results from 9 input sets with varying ( $ε, τ$ ) parameter settings, for an average of 5 independent trials. For each of the datasets, the three settings with the highest $F 1$ score are in bold. The result shows that the parameter settings that yield the most satisfactory clustering results are almost the same, regardless of the size of the input set.

Table 5

$F 1$ measurement of Pendigit-1000 input set (x axis: $ε, y$ axis: τ).

	10.0	12.0	14.0	16.0
15%	0.3300	0.2989	0.2862	0.2713
20%	0.3852	0.3213	0.3027	0.2945
25%	0.1898	0.2784	0.3596	0.3065
30%	0.0000	0.1655	0.3829	0.3328

Table 6

$F 1$ measurement of Pendigit-5000 input set (x axis: $ε, y$ axis: τ).

	10.0	12.0	14.0	16.0
15%	0.3305	0.3243	0.3002	0.2799
20%	0.2700	0.3505	0.3359	0.2977
25%	0.0000	0.2391	0.3181	0.3301
30%	0.0000	0.0000	0.3197	0.2734

Table 7

$F 1$ measurement of Pendigit-10000 input set (x axis: $ε, y$ axis: τ).

	10.0	12.0	14.0	16.0
15%	0.3231	0.3311	0.3213	0.2755
20%	0.0000	0.3754	0.3420	0.2984
25%	0.0000	0.2851	0.3712	0.3016
30%	0.0000	0.1526	0.3689	0.2705

Table 8

$F 1$ measurement of Cell-1000 input set (x axis: $ε, y$ axis: τ).

	16.0	20.0
32%	0.3630	0.1924
36%	0.0000	0.1746
40%	0.0000	0.3212
44%	0.0000	0.2769

Table 9

$F 1$ measurement of Cell-5000 input set (x axis: ε, y axis: τ).

	14.0	16.0	20.0
32%	0.1678	0.2920	0.1711
36%	0.0000	0.4572	0.1975
40%	0.0000	0.0000	0.3250
44%	0.0000	0.0000	0.4355

Table 10

$F 1$ measurement of Cell-10000 input set (x axis: $ε, y$ axis: τ).

	14.0	16.0	20.0
32%	0.3015	0.2805	0.2948
36%	0.3326	0.3513	0.3267
40%	0.2987	0.3622	0.3741
44%	0.0000	0.0000	0.3674

Table 11

$F 1$ measurement of Biodeg-1000 input set (x axis: $ε, y$ axis: τ).

	10.0	12.0	14.0	16.0
55%	0.5938	0.5745	0.5938	0.5911
60%	0.5774	0.5774	0.5836	0.5939
65%	0.0000	0.5939	0.5836	0.5836
70%	0.0000	0.5682	0.5939	0.5743

Table 12

$F 1$ measurement of Biodeg-5000 input set (x axis: $ε, y$ axis: τ).

	10.0	12.0	14.0	16.0
55%	0.6056	0.5996	0.6056	0.6031
60%	0.6136	0.6048	0.6136	0.6058
65%	0.5991	0.6024	0.5991	0.5991
70%	0.0000	0.6030	0.6048	0.6048

Table 13

$F 1$ measurement of Biodeg-10000 input set (x axis: $ε, y$ axis: τ).

	10.0	12.0	14.0	16.0
55%	0.6284	0.5972	0.6053	0.6155
60%	0.6315	0.6262	0.5961	0.6293
65%	0.6017	0.6294	0.6293	0.6242
70%	0.0000	0.6004	0.6025	0.6186

For example, Table 5 shows the $F 1$ score for pendigit-1000 input set. The table shows that none of the trials with $ε < 10$ results in a cluster. Table 6 shows the case of pendigit-5000 input set. Similar to the case of pendigit-1000, the result shows that parameter candidates with $ε < 10$ detect no clusters. This phenomenon is common throughout other input sets.

In the case of Pendigit dataset, the Pearson correlation coefficient of $F 1$ values between the full population and 10% sample was 0.5945, and the one between the full population and 50% sample was 0.7966, which suggests a strong positive linear relationship between them. In the case of the Cell and Biodeg datasets, the values of the Pearson correlation coefficient were 0.4915 and 0.6400, and 0.6985 and 0.9960, respectively. Figures 1, 2, and 3 show the trend of $F 1$ value between the two sample sets and the full population.

Figure 1

Correlation between different sizes: visualization of Tables 5–7 with x axis for parameter candidates (sort by ε first) and y axis for $F 1$ value.

Figure 2

Correlation between different sizes: visualization of Tables 8–10 with x axis for parameter candidates (sort by ε first) and y axis for $F 1$ value.

Figure 3

Correlation between different sizes: visualization of Tables 11–13 with x axis for parameter candidates (sort by ε first) and y axis for $F 1$ value.

The results suggest that the value of the most adequate parameter setting is not affected by the size of the input set. That is, it is a reasonable strategy to estimate optimal parameter values with a small sample of the full dataset to achieve both time efficiency and accuracy at once, evading the time limit. However, keeping the sample rate too low is not a good choice: as shown in Figure 3, the correlation coefficient between the 10% sample and the full population was only 70.1%~76.7% of the coefficient between the 50% sample and the full population. This trend suggests that trying with too small sample set may distort the result of estimation. Moreover, trying with too small $ε_{m i n}$ is also not recommended. Although our data consists of z-score values that are distributed on the range [0.0, 100.0), more than half of the candidates with $ε \leq 10.0$ yield no results.

5. Conclusions

Based on the experimental evaluation and result analysis, we propose the following methodology for estimating the optimal parameter values for subspace clustering. First, determine the value of $(ε_{m a x}, τ_{m i n})$ with which algorithm execution on the full input set finishes in the desired timeframe. Secondly, prepare candidate values for ( $ε, τ$ ) that satisfy $ε \leq ε_{m a x}$ and $τ_{m i n} \leq τ$ . Then, select one from these candidates as $(ε_{o p t}, τ_{o p t})$ that yields the best quality, when applied with the reduced input set of the original one, for example, the 10% random sample.

The advantage of this strategy may be twofold: it not only allows searching and comparison on various combinations of candidates, but also helps predictability of estimating the execution time. Experimental results with real-world datasets suggest that parameter values obtained from this approach can show the best accuracy on the full population of the input set.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This research was supported by the Sookmyung Women's University Research Grants (1-1403-0114).

References

Agrawal

Gehrke

Gunopulos

Raghavan

Automatic subspace clustering of high dimensional data for data mining applications

Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD ‘98)

1998

New York, NY, USA

94 105

Assent

Krieger

Müller

Seidl

INSCY: indexing subspace clusters with in-process-removal of redundancy

Proceedings of the 8th IEEE International Conference on Data Mining (ICDM ‘08)

December 2008

Washington, DC, USA

719 724

10.1109/ICDM.2008.46

2-s2.0-67049137962

Kailing

Kriegel

H. P.

Kroger

Density-connected subspace clustering for high dimensional data

Proceedings of the 4th SIAM International Conference on Data Mining

2004

246 257

Müller

Assent

Günnemann

Seidl

Scalable density-based subspace clustering

Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM ‘11)

October 2011

New York, NY, USA

1077 1086

10.1145/2063576.2063733

2-s2.0-83055161713

Assent

Krieger

Müller

Seidl

Dusc: dimensionality unbiased subspace clustering

Proceedings of the 7th IEEE International Conference on Data Mining (ICDM ‘07)

October 2007

Washington DC, USA

409 414

10.1109/ICDM.2007.49

2-s2.0-47249137675

Lee

A distributed algorithm for high dimensional clustering [M.S. thesis] 2014

Seoul, Republic of Korea

Seoul National University

Yeon

Shim

Lee

S.-G.

Outlier detection techniques for biased opinion discovery

The Journal of Society for e-Business Studies 2013 18 4 315 326

Yeon

Lee

Shim

Lee

S.-G.

Product review data and sentiment analytical processing modeling

The Journal of Society for e-Business Studies 2011 16 4 125 137

Kriegel

H.-P.

Kröger

Zimek

Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Transactions on Knowledge Discovery from Data 2009 3 1, article 1

10.1145/1497577.1497578

2-s2.0-67149084291

10.

Ester

Kriegel

H.-P.

Sander

A density-based algorithm for discovering clusters in large spatial databases with noise

Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD ‘96)

1996

AAAI Press

226 231

11.

http://archive.ics.uci.edu/ml/

12.

http://giraph.apache.org/

13.

Günnemann

Färber

Müller

Assent

Seidl

External evaluation measures for subspace clustering

Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM ‘11)

October 2011

New York, NY, USA

1363 1372

10.1145/2063576.2063774

2-s2.0-83055191163

14.

Müller

Assent

Krieger

Günnemann

Seidl

DensEst: density estimation for data mining in high dimensional spaces

Proceedings of the SIAM International Conference on Data Mining (SDM '09)

May 2009

SIAM

175 186

2-s2.0-72849151744

10.1137/1.9781611972795.16

15.

Müller

Günnemann

Assent

Seidl

Evaluating clustering in subspace projections of high dimensional data

Proceedings of the VLDB Endowment 2009 2 1 1270 1281

10.14778/1687627.1687770

Impact Parameter Analysis of Subspace Clustering

Abstract

1. Introduction

2. Strategy

3. Experimental Setup

4. Results Analysis

5. Conclusions

Footnotes

Conflict of Interests

Acknowledgment

References