Clustering stability evaluation method based on SSIM

Abstract

In the clustering validity analysis, three main methods including intra-class cohesion, inter-class separation, and artificial judgment index can be used to evaluate the clustering results. If the clustering result is efficient, it means that the clustering stability is better. However, when those methods are used, it is essential to provide the sample data or clustering algorithms in advance. This paper proposes a clustering stability evaluation method based on the Elliptic Fourier Descriptor structural similarity index (EFD-SSIM), which can evaluate the clustering stability only when the clustering result is available. Its mechanism is that cluster is mapped into 2D graphics, and the degree of intra-class cohesion is measured based on the structural similarity (SSIM) on the graphics. As shown by the experimental results, EFD-SSIM has a good evaluation effect and it is consistent with the existing effectiveness evaluation indices of the clustering algorithm.

Keywords

Clustering stability SSIM edge detection Fourier descriptor similarity

Introduction

Clustering is the process of dividing physical or abstract sets into multiple clusters of similar objects, which are different from each other. Apart from that, the aim is to discover the potential distribution patterns, intrinsic structure of datasets, and the knowledge hidden in the dataset. The clustering algorithms, such as k-means,¹ BIRCH,² EM,³ DBSCAN,⁴ CLARANS,⁵ etc. are widely used in machine learning. As an unsupervised learning, clustering does not require the prior definition of classes or training data samples to indicate how the data should be related. Besides, it is difficult to find a perfect clustering algorithm, and the deviation is unavoidable in practice. Therefore, we need to design a comprehensive evaluation index of clustering results to measure the effectiveness of clustering algorithms.

The process of evaluating clustering results is also called clustering stability evaluation. The existing clustering stability evaluation indices can be classified into external evaluation index and internal evaluation index. The external evaluation index evaluates the clustering results by comparing these with the real distribution of datasets. Furthermore, its division is based on the contingency table, sample pair, and information entropy.^6–8 The Jaccard coefficient and Rand index (RI) calculate the similarity between the two clusters. RI is used to evaluate the degree of dissimilarity between the two sets. It is effective for random clustering but requires real labeling. In addition, the Jaccard coefficient is used to measure the importance of data for different clusters.⁹ Adjusted rand index (ARI) is used to compare cluster with the original classification of datasets and normalize the results. If the difference is smaller, the clustering effectiveness is better. Therefore, ARI can evaluate whether an algorithm is suitable for a dataset. Normalized Mutual Information (NMI),¹⁰ as a classic clustering evaluation index, measures the similarity degree of two clusters based on their mutual information. If the valuation of mutual information is larger, the clustering effectiveness is better. Normalizing the NMI is called Adjusted Mutual Information (AMI). Entropy index indicates the degree to which the data onto each cluster is composed of data onto a single class.¹¹ The total entropy is equal to the sum of the entropy of every cluster. If the entropy is smaller, the clustering effectiveness is good. At the same time, different external evaluation indices have different emphasis. Amigó et al.¹² proposed four formal constraints (cluster homogeneity, cluster completeness, ragbag, and cluster size vs. quantity) to compare the existing external evaluation indices. As pointed out by He and Yu,¹³ ARI is one of the best external evaluation indices at present. Clustering result deviation is a common phenomenon in data analysis. Although many clustering evaluation indices are proposed to cope with imbalance data and different cluster density,^6,14 the external evaluation indices for clustering results deviation have not been fully considered.

The internal evaluation indices, as the common method of discovering the best number of cluster, do not use the prior information on original datasets but discover the internal structure of datasets and distribution by evaluating clustering results. The internal evaluation indices consist of two categories: based on statistical information and based on geometric structure. ICP¹⁵ is an index based on statistical information, and it evaluates clustering result by measuring whether the points have the nearest distance into the same cluster. The internal evaluation indices are based on the geometric structure of the dataset; Davies-Bouldin (DB) index, as a commonly used, takes the largest quotient of the intra-class average distance of two clusters and distance of center of two clusters as the valuation index. DB is not fit for evaluating the ring distribution because it is calculated by Euclidean distance. (Xie-Beni (XB) index¹⁵ considers both the fuzzy membership and dataset structure. The degree of cohesion and separation are related to each other. When the degree of cohesion is larger, the degree of separation is smaller, and vice versa. Furthermore, Silhouette coefficient (SC) is proposed for this phenomenon. It not only quantifies the similarity of intra-class data and similarity of inter-class data but also combines them in some ways. It is fit for the dataset with unknown actual information. The larger the value of SC, the more stable the clustering result.¹⁶ Dunn Validity Index (DVI) index uses the shortest distance between two clusters to calculate the degree of separation of inter-class. The maximum diameter in cluster is used to calculate the intra-class cohesion and the ratio of resolution to cohesion as an indicator, the larger the ratio means the better the clustering effect. The larger DVI means, the better the clustering result. DVI is more effective on discrete dataset, which is not fit for the ring distribution datasets.¹⁷ Due to the defects of some internal evaluation indices, it is difficult to judge the structure of clusters, resulting in poor clustering effectiveness valuation. Hence, it is hard to attain clustering results and find the best cluster numbers.

Although these evaluation methods have been successfully applied to different fields, their basic idea is based on statistical testing or seeks the best result of a clustering algorithm under certain hypothesis and parameters. The prerequisite is to get the knowledge of dataset or assume clustering algorithm. If the dataset or assuming clustering algorithm is not known in advance, then how to evaluate the clustering effectiveness?

It is important to solve the above problems. After analysing the intra-class cohesion, we think that when a stable clustering result is mapped into 2D graphics description, the boundary contour of graphics will not be easily changed because the center point of cluster has strong attraction to other points in the cluster. That is to say, when we randomly extract or add any similar points in a stable cluster, the boundary contour of the mapping graph is only slightly changed or unchanged. It can be concluded that the stability of clustering results is related to the changing of boundary contour of mapping graph. According to the above characteristics, we propose an structural similarity index based on Elliptic Fourier Descriptor (EFD) to evaluate the stability of clustering algorithm. The method takes EFD to describe the boundary contour of mapping graph and combines with the measurement of structural similarity index (SSIM) to evaluate the stability of clustering result. By using the image recognition method, this research aims to evaluate the stability of clustering algorithm, when the actual label of datasets or the assuming clustering algorithm is unknown.

The rest of the paper can be organized as follows: In SSIM based on EFD section, EFD-SSIM for clustering stability valuation is derived. In Experimental results and discussion section, the experimental process and results in instance data are given, including how to map cluster into Voronoi diagram, how todescribe the boundary contour of mapping graph, and how to calculate EFD-SSIM. At the same time,the EFD-SSIM will be compared with other clusteringstability evaluation methods.

SSIM based on EFD

Fourier expansion

First, we define a continuous curve c(t) in order to explain a Fourier expansion (Figure 1), which can be expressed by

c (t) = \frac{a_{0}}{2} + \sum_{k = 1}^{\infty} (a_{k} \cos (kwt) + b_{k} \sin (kwt))

(1)

Figure 1.

A continuous curve.

Figure 2.

Time serial graphic of DE.

According to the Euler’s formula

\begin{array}{l} c (t) = \frac{a_{0}}{2} + \sum_{k = 1}^{\infty} [\frac{a_{k}}{2} (e^{jkwt} + e^{- jkwt}) - j \frac{b_{k}}{2} (e^{jkwt} - e^{- jkwt})] \\ = \frac{a_{0}}{2} + \sum_{k = 1}^{\infty} [\frac{a_{k} - j b_{k}}{2} e^{jkwt} + \frac{a_{k} + j b_{k}}{2} e^{- jkwt}] \\ = c_{0} + \sum_{k = 1}^{\infty} [c_{k} e^{jkwt} + c_{- k} e^{- jkwt}] \\ = \sum_{k = - \infty}^{\infty} c_{k}^{'} e^{jkwt} \end{array}

(2)

If we define $c_{k} = c_{k 1} - j c_{k 2}, c_{- k} = c_{k 1} + j c_{k 2}$

Then, equation (2) can be derived as

\begin{array}{l} c (t) = c_{0} + 2 \sum_{k = 1}^{\infty} [c_{k 1} \frac{e^{jkwt} + e^{- jkwt}}{2} + j c_{k 2} \frac{- e^{jkwt} + e^{- jkwt}}{2}] \\ = c_{0} + \sum_{k = 1}^{\infty} [c_{k 1} \cos (kwt) + c_{k 2} \sin (kwt)] \\ = c_{0} + \sum_{k = 1}^{\infty} [a_{k} \cos (kwt) + b_{k} \sin (kwt)] \end{array}

(3)

where

a_{k}, b_{k}

is said to be Fourier Descriptor (FD).

Then, $c_{k}, c_{- k}$ can be derived from equations (2) and (3)

c_{k} = \frac{a_{k} - j b_{k}}{2}, c_{- k} = \frac{a_{k} + j b_{k}}{2}

(4)

The coefficients in equation (3) can be obtained by considering the orthogonal property. Thus, one way to compute values for the descriptors is

a_{k} = \frac{2}{T} \int_{0}^{T} c (t) \cos (kwt) d t, b_{k} = \frac{2}{T} \int_{0}^{T} c (t) \sin (kwt) d t

(5)

EFD

Shape is one of the most important visual features to describe a target. The existing shape representation methods are classified into two categories: shape representation methods based on regional features and shape presentation based on contour features. The latter uses pixel information on boundary of target coverage area to describe the shape.^18,19

FD is a classic contour-based shape representation that is originally proposed in 1960. The main idea is to use a set of data representing the overall frequency of the shape to describe the contour features and to have invariance to operations such as rotation and translation. It is a hot topic of the shape representation research. In terms of algorithm research, many researchers have done a lot of work to improve shape representation algorithm based on Fourier operator in order to enhance the ability of shape representation. Zhang and Lu proposed an enhanced universal FD to extract the key content of graph, which resolves the shortcoming that most of descriptors are not suitable for generic shape representation.²⁰ Li et al.²¹ proposed a region-based affine invariant ring FD for affine invariant feature extraction, which can be used to extract contour feature of the object with multiple components. Kasaudhan and Son proposed an enhanced version of the grid distance FD to calculate image similarity and improve the image matching ratio. Belkhaoui et al. combined FD with watershed algorithm, a process and method for auto target recognition based on inverse synthetic aperture radar image to solve the target recognition of the radar image.²² The principle and related work of FD will be described in detail below.

Let $(x_{0}, y_{0})$ be the starting point of the target boundary (Figure 1). After moving at a certain speed in the counterclockwise direction, the target boundary can be described by the coordinates of the boundary points. The boundary curve is defined as

s (t) = x (t) + jy (t) t = 0, 1, \dots, N - 1

(6)

where t is the unit arc-length that moves along the boundary curve. To describe the contour of the image, the selected starting point must to move one circle along the boundary. So, s(t) is a periodic function, T = 2π.

For obtaining EFDs of boundary, we need to obtain Fourier expansion of boundary as shown in equation (2).

s_{k} = \frac{1}{T} \int_{0}^{T} s (t) e^{- jkwt}

(7)

According to equation (6)

s_{k} = s_{xk} + j s_{yk}

(8)

Then

s_{xk} = \frac{1}{T} \int_{0}^{T} x (t) e^{- jkwt} s_{yk} = \frac{1}{T} \int_{0}^{T} y (t) e^{- jkwt}

(9)

According to equation (4)

\begin{array}{l} s_{xk} = \frac{a_{xk} - j b_{xk}}{2} & s_{yk} = \frac{a_{yk} - j b_{yk}}{2} \\ s_{x - k} = \frac{a_{xk} + j b_{xk}}{2} & s_{y - k} = \frac{a_{yk} + j b_{yk}}{2} \end{array}

(10)

Then, according to equation (5)

\begin{array}{l} a_{xk} = \frac{2}{T} \int_{0}^{T} x (t) \cos (kwt) d t, & b_{xk} = \frac{2}{T} \int_{0}^{T} x (t) \sin (kwt) d t \\ a_{yk} = \frac{2}{T} \int_{0}^{T} y (t) \cos (kwt) d t, & b_{yk} = \frac{2}{T} \int_{0}^{T} y (t) \sin (kwt) d t \end{array}

(11)

Because curve s(t) is non-continuous, we use the most direct Riemann summation method to approximate the integral value to obtain the discrete approximation of equation (11)

\begin{array}{l} a_{xk} = \frac{2}{m} \sum_{i}^{m} x_{i} \cos (kwi τ) & b_{xk} = \frac{2}{m} \sum_{i}^{m} x_{i} \sin (kwi τ) \\ a_{yk} = \frac{2}{m} \sum_{i}^{m} y_{i} \cos (kwi τ) & b_{yk} = \frac{2}{m} \sum_{i}^{m} y_{i} \sin (kwi τ) \end{array}

(12)

where

a_{xk}, a_{yk}, b_{xk}, b_{yk}

denote as an ellipse; m is the number of sampling points along the boundary, m is a half of the number of sampling points in general.

τ = T / m

is the sampling period;

x_{i}

and

y_{i}

are values at the ith sample point.

According to equations (8) and (10), $s_{k}$ can be seen as the sum of the complex pairs

s_{k} = A_{k} - j B_{k} s_{- k} = A_{k} + j B_{k}

(13)

We assume that

A_{k} = \frac{a_{xk} + j a_{yk}}{2} B_{k} = \frac{b_{xk} + j b_{yk}}{2}

(14)

Equation (6) can be expressed by equation (2) as follows

s (t) = s_{0} + \sum_{1}^{\infty} (A_{k} - j B_{k}) e^{jkwt} + \sum_{- \infty}^{- 1} (A_{k} + j B_{k}) e^{jkwt}

(15)

We assume that s(t) shifts Δx and Δy along the XY-coordinate, respectively. Rotating angle is φ in anti-clockwise direction. Then, we can obtain a new curve $\hat{s} (t) = \hat{x} (t) + j \hat{y} (t)$ . The relationship between s(t) and $\hat{s} (t)$ is obtained from equation (1)

\begin{array}{l} (\begin{matrix} \hat{x} (t) \\ \hat{y} (t) \end{matrix}) = \frac{1}{2} (\begin{matrix} a_{x 0} + 2 Δ x \\ a_{y 0} + 2 Δ y \end{matrix}) + s (\begin{matrix} \cos (φ) & \sin (φ) \\ - \sin (φ) & \cos (φ) \end{matrix}) \\ \times \sum_{k = 1}^{\infty} (\begin{matrix} a_{xk} & b_{xk} \\ a_{yk} & b_{yk} \end{matrix}) (\begin{matrix} \cos (kwt) \\ \sin (kwt) \end{matrix}) \end{array}

(16)

The coefficients in equation (16) are computed by

\begin{array}{l} {\hat{a}}_{xk} = s (a_{xk} \cos (φ) + a_{yk} \sin (φ)) \\ {\hat{b}}_{xk} = s (b_{xk} \cos (φ) + b_{yk} \sin (φ)) \\ {\hat{a}}_{yk} = s (- a_{xk} \sin (φ) + a_{yk} \cos (φ)) \\ {\hat{b}}_{yk} = s (- b_{xk} \sin (φ) + b_{yk} \cos (φ)) \\ {\hat{a}}_{x 0} = a_{xo} + 2 Δ x \\ {\hat{a}}_{y 0} = a_{yo} + 2 Δ y \end{array}

(17)

The following conclusion can be proved

\frac{| {\hat{A}}_{k} |}{| {\hat{A}}_{1} |} + \frac{| {\hat{B}}_{k} |}{| {\hat{B}}_{1} |} = \frac{| A_{k} |}{| A_{1} |} + \frac{| B_{k} |}{| B_{1} |} = \frac{\sqrt{a_{xk}^{2} + a_{yk}^{2}}}{\sqrt{a_{x 1}^{2} + a_{y 1}^{2}}} + \frac{\sqrt{b_{xk}^{2} + b_{yk}^{2}}}{\sqrt{b_{x 1}^{2} + b_{y 1}^{2}}}

(18)

Thus, EFD with translation rotation and scale transformation invariant can be expressed as

EFD = \frac{| A_{k} |}{| A_{1} |} + \frac{| B_{k} |}{| B_{1} |}

(19)

EFD-SSIM

Structural similarity index (SSIM) is an index of the similarity of two images,²³ which measure similarity of brightness, contrast, and structure of the two images. Its calculation only requires mean, variance, and covariance and does not require complex image feature extraction process, so it has been widely applied in image and video processing applications.^24–27 SSIM is expressed by

\begin{array}{l} S SIM (x, y) = L (x, y) \times C (x, y) \times S (x, y) \\ L (x, y) = \frac{2 μ_{x} μ_{y} + c_{1}}{μ_{x}^{2} + μ_{y}^{2} + c_{1}} \\ C (x, y) = \frac{2 σ_{x} σ_{y} + c_{2}}{σ_{x}^{2} + σ_{y}^{2} + c_{2}} \\ S (x, y) = \frac{σ_{xy} + c_{3}}{σ_{x} σ_{y} + c_{3}} \end{array}

(20)

\begin{array}{l} c_{1} = {(k_{1} \times l)}^{2} \\ c_{2} = {(k_{2} \times l)}^{2} \\ c_{3} = c_{2} / 2 \end{array}

(21)

where L(x, y) is brightness similarity between image x and image y. C(x, y) is contrast similarity between images. S(x, y) is the structural similarity on images.

μ_{x}

and

μ_{y}

denote mean on images x and y, respectively,

σ_{x}

and

σ_{y}

denote standard variance on images x and y, respectively.

σ_{xy}

denotes covariance of images x and y. Generally,

k_{1} = 0.01, k_{2} = 0.03, l = 255

(range of pixel valuation, its value is 255 in general).

We believe that the stability evaluation of clustering algorithm can be transformed into the image similarity calculate structural similarity calculation when the cluster is mapped into 2D graphics. This is because in a stable cluster, its center point will have strong attraction to other points in the cluster, that is, any point belonging to the cluster will be as close as possible to the center point and stay away from the boundary point. This characteristic is reflected in the mapping graph of the stable clustering result, when any similar points are added or removed, and the boundary contour of mapping graph will be unchanged or slightly changed. According to this characteristic, we take image structural similarity as clustering algorithm evaluation index, but the boundary contour features are considered as one of the elements of the image structural similarity comparison. In this way, the stability of clustering algorithm is measured. If the structural similarity of two mapping images is higher, then the stability of clustering algorithm is better. Otherwise, the effectiveness of the clustering algorithm is not good.

According to the above ideas, we propose a clustering algorithm stability evaluation method (EFD-SSIM), which evaluates stability by combining SSIM with FD. EFD-SSIM is defined by equations (22) and (23). A value of EFD-SSIM closer 1 indicates that clustering stability is good.

EFD - SSIM (x, y) = L (x, y) \times C (x, y) \times ES (x, y) \in [0, 1]

(22)

ES = \frac{| cov (EFD 1, EFD 2) |}{σ_{EFD 1} \times σ_{EFD 2}} \in [0, 1]

(23)

where

cov (\cdot)

denotes covariance of EFD.

σ_{EF D_{I}}

denotes standard variance of EFD.

Experimental results and discussion

In this section, the effectiveness of EFD-SSIM will be proved on four of different datasets.

Dataset 1: A case of fault bearings data, which is provided by Case Western Reserve University.²⁸

Bearing fault data come from a motor performance database which can be used to validate and/or improve a host of motor condition assessment techniques, with a motor bearing condition assessment system developed at Rockwell. Motor performance database includes normal bearings data, single-point drive end and fan end defects data. All data files are in Matlab format. Each file contains fan and drive end vibration data as well as motor rotational speed.

Dataset 2: A case study of liquefied petroleum gases (LPG) pipeline leak detection.²⁹ The LPG pipeline is more than 100 km in length with mass flow meters at inlet and outlet of the pipeline. Data are collected from those meters in every 10 s. The pipeline is mostly operated in leak-free (normal) condition. However, during a leak trial period (abnormal condition), a leak was created in the pipeline. The leak lasted for a few hours and leaks were controlled through a valve. Figure 6 shows the inlet and outlet flows readings (normal and leak, respectively).

Dataset 3: HTRU2 is a dataset which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey (South).³⁰ The dataset shared here contains 16,259 spurious examples caused by RFI/noise, and 1639 real pulsar examples. These examples have all been checked by human annotators. Each candidate is described by eight continuous variables. The first four are simple statistics obtained from the integrated pulse profile. This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency. The remaining four variables are similarly obtained from the DM-SNR curve.

Dataset 4: MFCCs includes acoustic features extracted from syllables of anuran (frogs) calls, including the family, the genus, and the species labels.³¹ This dataset was used in several classifications tasks related to the challenge of anuran species recognition through their calls. It is a multi-label dataset with three columns of labels. This dataset was created segment 60 audio records belonging to 4 different families, 8 genus, and 10 species. Each audio corresponds to one specimen (an individual frog); the record ID is also included as an extra column. We used the spectral entropy and a binary cluster method to detect audio frames belonging to each syllable. The segmentation and feature extraction were carried out in Matlab. After the segmentation, we obtained 7195 syllables, which became instances for training and testing the classifier. These records were collected in situ under real noise conditions (the background sound).

Experiment 1

In this section, we will use dataset 1 to verify EFD-SSIM. The dataset includes the following variables:

DE – drive end accelerometer data

FE – fan end accelerometer data

BA – base accelerometer data

time – time series data

RPM – r/min during testing

We select three clustering algorithms(Clarans, Kmeans, Dbscan) to make fault diagnosis based on feature clustering at the only DE (see Table 1) is used for testing EFD-SSIM. DE is time series data, and its graphic is shown in Figure 2.

Table 1.

Partial data of motor bearings.

i	DE	FE	i	DE	FE	i	DE	FE
1	–0.0028	–0.2472	13	–0.2110	0.0119	26	0.2659	0.2786
2	–0.0963	0.1428	14	–0.0468	0.1824	27	0.0218	–0.0351
3	0.1137	0.0033	15	0.2021	–0.0754	28	–0.1639	0.1545
4	0.2573	–0.1068	16	–0.0145	–0.0536	29	0.1420	0.1999
5	–0.0583	0.1360	17	–0.1628	0.0805	30	0.2311	–0.0980
6	–0.1260	–0.0051	18	0.1092	–0.1389	31	–0.1075	0.0105
7	0.2074	–0.0625	19	0.1871	–0.0555	32	–0.1402	0.0988
8	0.1727	0.2735	20	–0.1593	0.1650	33	0.1915	–0.0555
9	–0.2199	0.1473	21	–0.1377	–0.1582	–	–	–
10	–0.1561	–0.0925	22	0.2505	0.0370	–	–	–
11	0.2240	0.1709	23	0.1075	0.3096	–	–	–
12	0.1137	0.0427	24	–0.2469	–0.0857	n	–	–

DE: drive end; FE: fan end.

In the process of motor bearing fault diagnosis, the effect of fault feature extraction determines the final diagnostic rate. Peak average rectified (PAR), KURTOSIS and SKEWNESS cover the distribution characteristics, the statistical characteristics, and the linear characteristics of vibration, which effectively reflect the major characteristics of vibration events. Therefore, those characteristics are seen as the basis of fault diagnosis. Calculation method is shown as follows

PAR = E {\max (y^{\land} 2) / E (y^{\land} 2)}

(24)

where y denotes power of peak

K = \frac{{\int_{- \infty}^{+ \infty} [x (t) - \bar{x}]}^{4} p (x) d x}{σ^{4}}

(25)

S = \frac{E {(X - \bar{x})}^{3}}{σ^{3}}

(26)

where x(t) is the instantaneous amplitude,

\bar{x}

is the mean of amplitude, p(x) is the probability density, σ is the standard variance.

Because some of the eigenvectors of the vibration may have a certain correlation, it will effect the stability of the clustering fault diagnosis model. To overcome the effect, relevant eigenvectors must be first removed. According to the result of principal component analysis (PCA), PAR and KURTOSIS are irrelevant. They denote DE. Feature sets (F) are built by merging PAR with KURTOSIS (see Table 2).

Table 2.

Feature sets of DE.

i	PAR	KURT	i	PAR	KURT	i	PAR	KURT
1	3.2648	–0.3647	10	3.1979	–0.1333	19	3.4168	1.0757
2	3.0934	–0.3537	11	3.1780	–0.2197	20	3.7441	1.3899
3	3.0752	–0.4350	12	3.2202	–0.2486	21	3.7571	1.5904
4	3.0823	–0.4159	13	3.8729	1.5753	22	3.7994	1.5775
5	3.2896	0.0833	14	3.7868	1.7356	23	3.7263	1.3241
6	3.3391	0.2690	15	3.6880	1.5991	24	3.5290	0.6109
7	3.3191	0.1479	16	3.6230	1.4547	–	–	–
8	3.2075	0.2307	17	4.1758	1.8610	–	–	–
9	3.5993	–0.1001	18	3.5839	1.7459	n	–	–

PAR: peak average rectified.

Step 1: the best number of clusters is estimated in advance by existing clustering valuation indices such as DB, DVI, SC; the best number of clusters is 2. Test sets (T) are randomly extracted from feature sets (F) by the proportion of 3:1. F and T are classified into two classes as a result of prediction (see Figure 3).

Figure 3.

Results of clustering of F sets and T sets.

To evaluate the stability of clustering results using image evaluation techniques, the result of clustering must be mapped into 2D graphic. We will use Delaunay Triangulation algorithm to map a certain cluster (called class I) of F sets and T sets; mapped cluster in two sets are in a containment relationship. Mapped graphic is called Voronoi diagram; it is the structure of computational geometry, which can be used to qualitative analysis, statistical analysis and the nearest neighboring analysis.³² In this section, Euclidean distance of arbitrary two points of the class I is computed; take any point as the vertex of a triangle and join it with the nearest two points of Euclidean distance; a Delaunay triangular net is obtained after N-iterations are executed. Triangle with common vertex is recorded; their center of the circumcircle is computed and then connected with the center of those circle in clockwise direction. Then, a Voronoi diagram can be drawn; its time complexity is $O (\log n)$ . The process of algorithm is follows:

Hypotheses:

$(X, Y) = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})}$ . It represents the point set P consisting of N non-repeating points on the plane and the specific steps of constructing the Delaunay triangulation of point set P are:

Step 1: the N points are sorted by mainly based on X coordinates.

Step 2: structural process:

If N = 2, Return

If N = 3, three points are connected to construct a triangulation net and return

The N points are divided into subsets $P_{l}$ and $P_{r}$ on the basis ofevenly principle or nearest neighbor principle

Construct triangular net $DT (P_{l})$ of $P_{l}$

Construct triangular net $DT (P_{r})$ of $P_{r}$

merge $DT (P_{l})$ with $DT (P_{r})$ and put back

Step 3: merging process

For given $DT (P_{l})$ and $DT (P_{r})$ , calculate convex hull of $P_{l}$ and $P_{r}$

Obtain the top tangent UCT and the bottom tangent BCT

Start from the BCT, according to left endpoint, right endpoint and their adjacent points to complete $DT (P_{l})$ merging with $DT (P_{r})$ until the UCT is encountered.

The Voronoi diagram of class I of F sets and T sets is shown in Figure 4, respectively. Canny edge detection operator is used to recognize the boundary shape of the Voronoi diagram.

Figure 4.

Voronoi diagram of cluster.

Canny algorithm utilizes four operators to detect the edges of horizontal, vertical, and diagonal direction in an image. Derivatives are computed in horizontal θ and vertical $G_{y}$ . As a result, gradient strength G and orientation strength θ of pixels can be determined by

\begin{array}{l} G = \sqrt{G_{x}^{2} + G_{y}^{2}} \\ θ = \arctan (G_{y} / G_{x}) \end{array}

(27)

G of the current pixel is compared with two pixels in positive and negative gradient direction. This pixel will be regarded as an edge point if G of the current pixel is greater than the other pixels; if not, it will be restrained and called as maximum restrain value. Some pixels of edge still exist which are caused by noise and color changing after restraining is imposed on the maximum value. By setting up a choice of high and low thresholds to resolve this defect, this pixel will be seen as strong pixel if G is greater than the high threshold. This pixel will be seen as a weak pixel if G is lower than the high threshold and greater than low threshold. As long as there is a strong edge pixel in eight adjacent pixels of the weak edge pixels, it can be preserved as edge points. This pixel will be restrained if G lower is than the low threshold. Contour of mapping graphs is shown (see Figure 5) after Canny Algorithm is executed.

Figure 5.

Contour of mapped graphic.

EFD of contour of class I of cluster of F set and T set are computed by equations (7) to (19) (see Table 3).

Table 3.

EFD of contour of result of fault classes.

Test Set T	Clarans	0.0015	0.0031	0.0046	0.0076	0.0092	0.0107	0.0122	0/0137	–
	Kmeans	0.0040	0.0080	0.0120	0.0160	0.0199	0.0239	0.0279	0.0319	–
	Dbscan	0.0038	0.0077	0.0115	0.0192	0.0230	0.0268	0.0307	0.0345	–
Feature Set F	Clarans	0.0017	0.0033	0.0050	0.0067	0.0083	0.0100	0.0117	0.0133	–
	Kmeans	0.0039	0.0079	0.0118	0.0157	0.0197	0.0236	0.0275	0.0314	–
	Dbscan	0.0034	0.0068	0.0102	0.0135	0.0169	0.0203	0.0237	0.0271	–

EFD-SSIM of three clustering algorithms (Clarans, Kmeans, Dbscan) by equations (22) and (23) are 0.816237, 0.993007, 0.674093, respectively.

Experiment 2

In this experiment, dataset 2 is used. Figure 6 shows the inlet and outlet flow readings (normal and leak, respectively).

Figure 6.

Data of LPG.

In this section, we will use the same method as experiment 1 to 2D graphic mapping, detect edge, and calculate EFD. They will not be described again in here. Table 4 shows EFD of mapped graphic of LPG leak state diagnosis. The values in Table 4 are computed from equations (7) to (19). EFD-SSIM of three clustering algorithms (Clarans, Kmeans, Dbscan) by equations (22) and (23) are 0.888282, 0.999994, 0.180958, respectively.

Table 4.

EFD of mapped graphic of leak diagnosis.

Test Set T	Clarans	0.0007	0.0014	0.0064	0.0028	0.0036	0.0043	0.0050	0.0057	……
	Kmeans	0.0021	0.0043	0.0027	0.0086	0.0107	0.0129	0.0150	0.0172	……
	Dbscan	0.0008	0.0018	0.0115	0.0036	0.0044	0.0053	0.0062	0.0071	……
Feature set F	Clarans	0.0022	0.0043	0.0065	0.0086	0.0108	0.0130	0.0151	0.0173	……
	Kmeans	0.0022	0.0043	0.0064	0.0086	0.0107	0.0129	0.0150	0.0172	……
	Dbscan	0.0010	0.0019	0.0029	0.0039	0.0048	0.0058	0.0068	0.0078	……

Experiment 3

Dataset 3 HTRU2 is used in this experiment. The partial data of dataset HTRU2 are shown in Table 5.

Table 5.

The partial data of HTRU2.

i	MIP	SDIP	EKIP	SIP	MDSC	SDDSC	EKDSC	SDSC
1	140.5625	55.6837	–0.2345	–0.6996	3.1998	19.1104	7.9755	74.2422
2	102.5078	58.8824	0.4653	–0.5150	1.6772	14.8601	10.5764	127.3936
3	103.0156	39.3416	0.3233	1.0511	3.1212	21.7446	7.7358	63.1719
4	136.75	57.1784	–0.0684	–0.6362	3.6429	20.9592	6.8964	53.5936
5	88.7265	40.6722	0.6008	1.1234	1.1789	11.4687	14.2695	252.5673
6	93.57031	46.6981	0.5319	0.4167	1.6362	14.5450	10.6217	131.394
7	119.4844	48.7650	0.0314	–0.1121	0.9991	9.2796	19.2062	479.7566
8	130.3828	39.8440	–0.1583	0.3895	1.2207	14.3789	13.5394	198.2365
9	107.25	52.6270	0.45268	0.17034	2.3319	14.4868	9.0010	107.9725
10	107.2578	39.4964	0.4658	1.1628	4.0794	24.9804	7.3970	57.7847
11	142.0781	45.2880	–0.3203	0.2839	5.3762	29.0099	6.0762	37.8313
12	–	–	–	–	–	–	–	–

MIP: mean of the integrated profile; SDIP: standard deviation of the integrated profile; EKIP: excess kurtosis of the integrated profile; SIP: Skewness of the integrated profile; MDSC: mean of the DM-SNR curve; SDDSC: standard deviation of the DM-SNR curve; EKDSC: excess kurtosis of the DM-SNR curve; SDSC: Skewness of the DM-SNR curve.

In this section, we will use the same method as experiment 1 to 2D graphic mapping, edge detection, and EFD calculated. They will not be described again in here. According to the real label of HTRU2, we let the number of clusters to be 2. Table 6 shows EFD mapping graph of the clusters of dataset HTRU2. The values in Table 6 are computed by equations (7) to (19). EFD-SSIM of three clustering algorithms (Clarans, Kmeans, Dbscan) by equations (22) and (23) are 0.999989, 0.999993, 0.994068, respectively.

Table 6.

EFD of mapped graphic of HTRU2.

Test sets T	Clarans	0.0015	0.0029	0.0043	0.0058	0.0073	0.0088	0.0102	0.0117	–
	Kmeans	0.0012	0.0031	0.0040	0.0058	0.0069	0.0087	0.0111	0.0117	–
	Dbscan	0.2823	0.2837	0.2879	0.2893	0.2907	0.2921	0.2935	0.2945	–
Feature sets F	Clarans	0.0014	0.0030	0.0044	0.0059	0.0074	0.0089	0.0104	0.0118	–
	Kmeans	0.0015	0.0029	0.0044	0.0058	0.0073	0.0088	0.0102	0.0117	–
	Dbscan	0.0131	0.1461	0.0161	0.0175	0.1900	0.0204	0.0219	0.0234	–

Experiment 4

Dataset 4 MFCCs is used in this experiment. The partial data of dataset MFCCs are shown in Tables 5 and 7. Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up a mel-frequency cepstrum (MFC). Since each syllable has different length, every row (i) was normalized according to MFCCs_i/(max(abs(MFCCs_i))).

Table 7.

EFD of mapped graphic of MFCCs.

i	MFCCs_ 0	MFCCs_ 1	MFCCs_ 2	MFCCs_ 3	MFCCs_ 4	MFCCs_ 5	MFCCs_ 6
1	0.1529	–0.1055	0.2007	0.3172	0.2607	0.1009	–0.1500
2	0.1715	–0.0989	0.2684	0.3386	0.2683	0.0608	–0.2224
3	0.1523	–0.0829	0.2871	0.2760	0.1898	0.0087	–0.2422
4	0.2243	0.1189	0.3294	0.3720	0.3610	0.0155	–0.1943
5	0.0878	–0.0683	0.3069	0.3309	0.2491	0.0068	–0.2654
6	0.0997	–0.0334	0.3498	0.3445	0.2475	0.0224	–0.2137
7	0.0216	–0.0620	0.3182	0.3804	0.1793	–0.0416	–0.2523
8	0.1451	–0.0336	0.2841	0.2795	0.1752	0.0057	–0.1833
9	0.2713	0.0277	0.3757	0.3854	0.2724	0.0981	–0.1737
10	0.1205	–0.1072	0.3165	0.3644	0.3077	0.0259	–0.2941
11	–	–	–	–	–	–	–

In this section, we will use the same method as experiment 1 to 2 D graphic mapping, edge detection, and EFD calculated. They will not be described again in here. According to the real label of MFCCs, we let the number of clusters to be 3. Table 8 shows EFD mapping graph of the clusters of dataset MFFCs. The values in Table 8 are computed by equations (7) to (19). EFD-SSIM of three clustering algorithms (Clarans, Kmeans, Dbscan) by equations (22) and (23) are 0.888282, 0.999963, 0.180951, respectively.

Table 8.

EFD of mapped graphic of MFCCs.

Test sets T	Clarans	0.0009	0.0019	0.0028	0.0038	0.0047	0.0057	0.0066	0.0075	–
	Kmeans	0.0007	0.0014	0.0021	0.0029	0.0036	0.0043	0.0051	0.0057	–
	Dbscan	0.0014	0.0028	0.0042	0.0056	0.0070	0.0084	0.0098	0.0112	–
Feature Sets F	Clarans	0.0011	0.0022	0.0033	0.0044	0.0055	0.0066	0.0077	0.0088	–
	Kmeans	0.0019	0.0039	0.0057	0.0077	0.0097	0.0116	0.0135	0.0155	–
	Dbscan	0.0014	0.0028	0.0042	0.0055	0.0071	0.0084	0.0080	0.0234	–

Discussion

Generally speaking, the method of effectiveness evaluation of clustering includes two categories. To be specific, one is within cluster distance (e.g. within.cluster.ss index, which is the square of distance of elements in every cluster) and the other is inter-cluster distance (e.g. avg.silwidth index). If the value of within.cluster.ss is lower and the value of avg.siwidth is greater, it denotes that the effectiveness of clustering is good. In this section, EFD-SSIM from experiments 1, 2, 3, 4 is compared with avg.silwidth and within.cluster.ss that are computed on the same data (see Tables 9 to 12). We can observe from the tables that the results of EFD-SSIM index are in accordance with the results of within.cluster.ss and avg.silwidth. Meanwhile, the effectiveness of EFD-SSIM is proved. Since EFD-SSIM is only verified on Kmeans, Clarans, Dbscan, the adaptability of EFD-SSIM to other clustering algorithms is our future research direction.

Table 9.

Results of experiment 1.

Index	Clarans	Kmeans	Dbscan
within.cluster.ss	26.26355	25.70769	49.13795
avg.silwidth	0.719723	0.734621	0.706651
EFD-SSIM	0.816237	0.993007	0.674093

Table 10.

Results of experiment 2.

Index	Clarans	Kmean	Dbscan
within.cluster.ss	883.8916	874.0414	918.5477
avg.silwidth	0.482124	0.492672	0.478369
EFD-SSIM	0.888282	0.999994	0.180958

Table 11.

Results of experiment 3.

Index	Clarans	Kmean	Dbscan
within.cluster.ss	450.4787	434.0442	673.6689
avg.silwidth	0.370677	0.372512	0.153649
EFD-SSIM	0.999989	0.999993	0.994068

Table 12.

Results of experiment 4.

Index	Clarans	Kmean	Dbscan
within.cluster.ss	454.388	432.9576	892.7699
avg.silwidth	0.379861	0.387172	0.26981
EFD-SSIM	0.888282	0.999963	0.180951

Conclusions

EFD-SSIM is a new clustering algorithm for stability evaluation index. According to the graphical representation of clusters, the stability of clustering results is evaluated by the SSIM based on FD. This method can quantitatively evaluate the stability of the clustering effect based on visual graph of the clustering result without supervision. From the experimental methods and principles, the EFD-SSIM is effective.

Footnotes

Acknowledgements

We would like to thank Professor Yan Zhu Hu and Professor Xin Bo Ai for inspiring discussions about clustering algorithm. Thanks to Hui Yang and Zhen Meng for their assistance in the experiment. We thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This paper is supported by Science and Technology Plan Project of Beijing (Z181100000618006), by the National Natural Science Foundation of China (61627816), by Science and Technology Plan Project of Beijing (D161100004916002).

References

Steinhaus

Sur la division des corps matériels en parties. Bullacadpolonscicliii 1956; 4(12): 801–804

Zhang

Ramakrishnan

Livny

et al. A New data clustering algorithm and its applications [J].Data Mining and Knowledge Discovery 1997; 1 (2): 182.

Marwala

Gaussian mixture models and hidden Markov models for condition monitoring [M]. In: Condition Monitoring Using Computational Intelligence Methods. London: Springer, 2012, pp. 111–130.

Hinneburg

Keim

A general approach to clustering in large databases with noise [J]. Knowl Inf Syst 2003; 5(4): 387–415.

Hmayer

Ezzine

(eds). CLARANS heuristic based approach for the k-traveling repairman problem.In: Proceedings of the 2013 International conference on advanced logistics and transport, 29-31 May 2013.

Souto

Coelho

ALV

Faceli

, et al. (eds). A comparison of external clustering evaluation indices in the context of imbalanced data sets. In: Proceedings of the 2012 Brazilian Symposium on Neural Networks, 20-25 October 2012.

Xiong

Chen

K-means clustering versus validation measures: a data-distribution perspective. IEEE Trans Syst Man Cybernet Part B Cybernet 2009; 39: 31831.

Crutzen

Giabbanelli

Jander

, et al. Identifying binge drinkers based on parenting dimensions and alcohol-specific parenting practices: building classifiers on adolescent-parent paired data. BMC Public Health 2015; 15: 747.

Rand

WM.

Objective criteria for the evaluation of clustering methods. Public Am Stat Assoc 1971; 66: 846–850.

10.

Mcdaid

Greene

Hurley

Normalized Mutual Information to evaluate overlapping community finding algorithms.

Computer Science 2011; 22(3): 493–521.

11.

Bezdek

Keller

Krisnapuram

, et al. Fuzzy models and algorithms for pattern recognition and image processing (Vol. 4). Springer Science & Business Media, 1999.

12.

Amigó

Gonzalo

Artiles

, et al. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retrieval 2009; 12: 461–486.

13.

Clustering stability-based evolutionary K-means. Soft Comput 2019; 23: 305–321.

14.

Chou

Lai

A new cluster validity measure and its application to image compression. Pattern Anal Appl 2004; 7: 205–220.

15.

Kapp

Robert

Are clusters found in one dataset present in another dataset?

Biostatistics 2007; 8: 9–31.

16.

Kaufman

Rousseeuw

Massart

, et al. Least median of squares: a robust method for outlier and model error detection in regression and calibration. Anal Chim Acta 1986; 187: 171–179.

17.

Dunn

JC.

A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybernet 1973; 3: 32–57.

18.

Nikolic

Tuba

(eds). Edge detection in medical ultrasound images using adjusted Canny edge detection algorithm [C] In: Proceedings of the 2016 24th Telecommunications forum, 22-23 November 2016.

19.

Hamd

Ahmed

SK.

Fourier descriptors for iris recognition. IJCDS J 2017; 6: 2210–2142.

20.

Zhang

(eds). Enhanced generic Fourier descriptors for object-based image retrieval. In: Proceedings of the 2002 IEEE international conference on acoustics, speech, and signal processing, 13-17 May 2002.

21.

Huang

Yang

(eds). Affine invariant ring Fourier descriptors. In: International conference on wavelet analysis and pattern recognition; 2013.

22.

Belkhaoui

Toumi

Khalfallah

Fusion Fourier descriptors from the EM [J].

International Journal of Computer & Information Technology 2013.

23.

Kang

SJ.

SSIM preservation-based backlight dimming. J Display Technol 2017; 10: 247–250.

24.

Wang

Rehman

Wang

, et al. SSIM-motivated rate-distortion optimization for video coding. IEEE Trans Circuits Syst Video Technol 2012; 22: 516–529.

25.

Aswathappa

BHK

Rao

(eds). Rate-distortion optimization using structural information in H.264 strictly Intra-frame encoder. In: System theory [J]; 2010.

26.

Mai

Yang

Xie

(eds). Improved best prediction mode(s) selection methods based on structural similarity in H.264 I-frame encoder. In: Proceedings of the 2005 IEEE international conference on systems, man and cybernetics, 12 October 2005.

27.

Bae

Kim

(eds). A novel SSIM index for image quality assessment using a new luminance adaptation effect model in pixel intensity domain. In: Visual Commun Image Process (VCIP), 13-16 December 2015.

28.

University CWR. Ball bearing test data Case Western Reserve University, https://csegroups.case.edu/bearingdatacenter/pages/12k-drive-end-bearing-fault-data

29.

Song

, et al. The optimal design of industrial alarm systems based on evidence theory. Control Eng Pract 2016; 46: 142–156.

30.

Keith

Jameson

Van Straten

, et al. The high time resolution universe pulsar survey I: system configuration and initial discoveries. Monthly Notice R Astronom Soc 2010; 409: 619–627.

31.

Colonna

Cristo

Júnior

, et al. An incremental technique for real-time bioacoustic signal segmentation. Expert Syst Appl 2015; 42: 7367–7374.