Sage Journals: Discover world-class research

Abstract

Gene expression data is a kind of high dimension and small sample size data. The clustering accuracy of conventional clustering techniques is lower on gene expression data due to its high dimension. Because some subspace segmentation approaches can be better applied in the high-dimensional space, three new subspace clustering models for gene expression data sets are proposed in this work. The proposed projection subspace clustering models have projection sparse subspace clustering, projection low-rank representation subspace clustering and projection least-squares regression subspace clustering which combine projection technique with sparse subspace clustering, low-rank representation and least-square regression, respectively. In order to compute the inner product in the high-dimensional space, the kernel function is used to the projection subspace clustering models. The experimental results on six gene expression data sets show these models are effective.

Keywords

Gene expression data subspace clustering sparse low-rank representation least-square regression projection

Introduction

In the past few decades, many pattern recognition methods have been applied on the research for gene expression data, such as the improved sparse expression,¹ the sparse nonnegative matrix factorization (NMF)² and graph regularized-based subspace clustering.³ The biggest challenge for the study of gene expression data is its high-dimensional and small sample size problem, because the gene expression data have usually only tens to hundreds of samples and each sample contains thousands or even tens of thousands of genes. Despite these difficulties, there are still many classification and clustering methods used for gene expression data analysis. In this paper, we will do some research on clustering of gene expression data.

Many traditional clustering methods in low-dimensional data can achieve perfect results, while good clustering results are not available on high-dimensional data, such as k-means,⁴ k-medoid,⁵ hierarchical clustering and self-organizing map. For the high-dimensional data, the usual method is to reduce the dimensions by applying the principal component analysis (PCA)⁶ first, and then implement the clustering algorithms.⁷ However, PCA is a linear unsupervised dimension reduction method. Its goal is to learn an orthogonal transformation such that the transformed data have the largest possible variance while ignore the potential structure between categories from a global point of view.

For example, the Figure 1 shows the data of two categories that are randomly generated from two-dimensional normal distribution; each one has 100 data points. As shown in Figure 1(a), the solid line A and the broken line B represent the first projection direction of PCA and the appropriate projection direction, respectively. Figure 1(b) shows the projected data, which is obvious that B is the better projection direction than A, and the literature⁸ also had the same conclusion. When the number of subspaces (i.e. the number of clusters) is 1, a perfect low-dimensional expression can be gotten by PCA, while when the number of subspaces is larger than 1, PCA is not a good approach of dimension reduction. High-dimensional data space has more subspaces, which is the reason that PCA cannot find one-to-one correspondence dimension between subspaces.

Figure 1.

Compare the projection direction of PCA with the appropriate projection direction.

Inspired by subspace segmentation methods such as sparse subspace clustering (SSC),⁹ low-rank representation subspace clustering (LRR)¹⁰ and least-square regression subspace clustering (LSR)¹¹ which regard similar gene expression data as a subspace, we propose a projection subspace clustering (PSC) in order to solve the high-dimensional non-linear problem, which can achieve dimension reduction and subspace segmentation simultaneously.

The organization of the rest of this paper is as follows. In Subspace clustering section, we review some related work of subspace clustering, such as sparse SSC, LRR and LSR. Projection subspace clustering section presents our PSC method. In Experimental verification section, experiments on gene expression data clustering are conducted. Conclusions are made in Conclusions section.

Subspace clustering

Subspace clustering is an important clustering method for machine learning, which has been successfully applied in machine vision and other fields, that is clustering and image representation.^9,11 The goal of subspace clustering is to segment sample data or group samples into several clusters such that each cluster corresponds to a subspace, therefore, also known as subspace segmentation.

Definition

(subspace clustering)⁹

Given a data set $X = [X 1, \dots, X k] = [x 1, \dots, x n] \in ℝ d \times n$ , which drawn from a union of k subspaces ${S i} i = 1 k$ , where d is the feature dimension and n is the sample size. Let X_i be a collection of n_i samples drawn from the subspace S_i, $n = \sum_{i = 1}^{k} n i$ . The task is to segment the data set according to the underlying subspaces where they are drawn from.

In the past two decades, many subspace segmentation methods have been proposed. Existing works on subspace segmentation can be roughly divided into four categories: algebraic methods, statistical methods, iterative methods and spectral clustering-based methods.¹¹ A review of subspace segmentation can be found in the literature.⁸

The important task of spectral clustering-based subspace segmentation methods is to find an affinity matrix $Z = (z ij) n \times n$ , where $z ij$ measures the similarity between sample points x_i and x_j. A common similarity measure between the two sample points is $Z ij = exp (- ‖ x i - x j ‖ / σ), σ > 0$ . However, this method cannot characterize the structure of samples from subspaces.¹¹ Many recent works for spectral clustering-based subspace clustering, such as SSC,⁹ LRR¹⁰ and LSR,¹¹ propose some new methods to construct the affinity matrix. These methods express each sample point x_i as a linear combination of all other sample points, that is

x i = \sum_{j \neq i} x j z ij

(1)

then use the representational coefficient

(| z ij | + | z ji |) / 2

to measure the similarity between sample points x_i and x_j. Recently, some improved versions of these methods have been proposed, such as robust subspace segmentation¹² and space clustering with block-diagonal.¹³

Sparse subspace clustering SSC is a sparse representation method, and usually solves the following L₁-minimization problem because the original SCC optimization problem is non-convex and NP-hard¹¹

min Z ‖ Z ‖ 1 s . t . X = XZ, diag (Z) = 0

(2)

where

‖ Z ‖ 1

denotes the L₁ norm of Z,

‖ Z ‖ 1 = \sum_{i, j} | z ij |

. In Elhamifar and Vidal,⁹ SSC is extended to the model for data with noise and its objective function is as follows

min Z ‖ Z ‖ 1 + \frac{λ}{2} | | X - XZ {| |}_{F}^{2} s . t . diag (Z) = 0

(3)

Low-rank representation subspace clustering LRR subspace clustering is a low-rank representation method, the original LRR solves the following rank minimization problem

min Z rank (Z) s . t . X = XZ

(4)

where rank(Z) is the rank of Z. The rank minimization problem is also NP-hard. In Liu et al.,¹⁰ the nuclear norm is used to instead of rank(Z). Thus, LRR is replaced to solve the following problem

min Z ‖ Z ‖ * s . t . X = XZ

(5)

where

‖ Z ‖ *

denotes the nuclear norm of Z, defined as the sum of all the singular values of Z. It can be further extended to the noisy case

min Z ‖ Z ‖ * + λ ‖ E ‖ 2, 1 s . t . X = XZ + E

(6)

where λ > 0, $‖ E ‖ 2, 1$ is l_2,1 norm of E, defined as the sum of the l₂ norm of each column, which can keep row-sparsity (elements in a row are all zero).

Least-square regression subspace clustering

LSR subspace clustering minimized the Frobenius-norm of Z, and its objective function is as follow

min Z ‖ Z ‖ F s . t . X = XZ, diag (Z) = 0

(7)

Its extended model for noisy case as follow

min Z ‖ X - XZ ‖ F 2 + λ ‖ Z ‖ F 2 s . t . diag (Z) = 0

(8)

Remove the restrictive conditions, but also can be extended to

min Z ‖ X - XZ ‖ F 2 + λ ‖ Z ‖ F 2

(9)

where λ > 0,

‖ Z ‖ F

is the Frobenius-norm of Z. Gives the model (8) and (9) an efficient calculation method, and the experimental results show that there is little difference between equations (8) and (9) in Lu et al.¹¹

Subspace clustering algorithms described above have a good aggregation, especially the LSR method is robust and can quickly gets the representation factor.¹¹ But for clustering gene expression data, the effects of clustering are easily affected by noise when the above three subspace clustering methods are applied directly on the high-dimensional data.

Projection subspace clustering

In high-dimensional space, the existing subspace clustering method, such as SSC or LRR or LSR, is not only easily affected by noise, and is not conducive to discover the geometry structure of the nature of the data set. We introduce the projection technology to extend SSC, LRR and LSR, and propose the projection sparse subspace clustering (PSSC), projection low-rank representation subspace clustering (PLRR) and projection least-squares regression subspace clustering (PLSR).

The main idea of the PSC is to map the data set X from the original space into a new space by learning a linear projection matrix P, the obtained data set Y in the new space has Y = PX. In order to reduce the complexity of the inner product computation in high-dimensional space, the linear kernel function K_ij = K(x_i, x_j) is introduced and let K = X^TX. We respectively discuss the three PSC model as follows.

Models of PSC

Based on the discussion in Subspace clustering section, the general subspace clustering model can be summarized as follows

min P, Z Ω (Z) + λ ‖ X - XZ ‖ l s . t . Z \in C

(10)

where

| | \cdot | | l

is a proper norm,

Ω (Z)

and C are the regularization and constraint set on Z.

PSC model combines projection technology with subspace clustering to avoid trivial solution, we add the constraint condition $PXX T P T = I$ , and then have

min P, Z Ω (Z) + λ ‖ PX - PXZ ‖ l s . t . PXX T P = I, Z \in C

(11)

For different choices of $Ω (Z)$ and $| | \cdot | | l$ in SSC, LRR and LSR, we propose three PSC methods: PSSC, PLRR, and PLSR.

Projection sparse subspace clustering PSSC model combines projection technology with SSC, to avoid trivial solution, we add the constraint condition $PXX T P T = I$ , and then have

min P, Z ‖ Z ‖ 1 + \frac{λ}{2} ‖ PX - PXZ ‖ F 2 s . t . PXX T P T = I, diag (Z) = 0

(12)

where P is a linear projection matrix, λ > 0. The formulation (12) is our proposed the PSSC model.

Projection low-rank representation PLRR combines projection technology with LRR. The original constraint item of LRR is to use L_2,1-norm of the noise in the formulation (6), but in practical applications, F-norm constraint is often more efficient than L_2,1-norm,¹⁴ so the formulation (6) is modified as follows

min Z ‖ Z ‖ * + \frac{λ}{2} ‖ E ‖ F 2 s . t . X = XZ + E

(13)

PLRR with projection technology in equation (13) is

min P, Z ‖ Z ‖ * + \frac{λ}{2} ‖ PX - PXZ ‖ F 2 s . t . PXX T P T = I

(14)

where P is a linear projection matrix and λ > 0. The formulation (14) is our proposed the PLRR model.

Projection least-square regression PLSR combined projection technique with LSR, to avoid trivial solution, we add the constraint condition $PXX T P T = I$ , and then have

min P, Z ‖ PX - PXZ ‖ F 2 + λ ‖ Z ‖ F 2 s . t . PXX T P T = I

(15)

where P is a projection matrix and λ > 0. The formulation (15) is our proposed PLSR model.

Optimization solution of PSSC model

The PSSC model has two variables P and Z, and the solutions of P and Z can be obtained by the alternative optimization approach, which divide the problem into two steps: learning the projection matrix P while fixing the reconstruction coefficient matrix Z and learning Z while fixing P.

Learning P while fixing Z Because $‖ PX - PXZ ‖ F 2$ use the F-norm in model (12), the model (12) is equivalent to the following problem when Z is fixed

min P Tr (PX (I - Z - Z T + ZZ T) X T P T) s . t . PXX T P T = I

(16)

The solution of equation (16) can be obtained by solving the generalized eigenvalue problem. However, the dimensionality of the matrix $X (I - Z - Z T + ZZ T) X T$ is high for the gene expression data. If the generalized eigenvalue problem of the matrix is calculated directly, it not only needs a large memory, but also generates some other problems due to the matrix singular. Fortunately, according to the theory of kernels,¹⁵ the projection matrix P can be expanded in terms of $P = W T X T$ and substituted to equation (16), the problem (16) can be converted to

min W Tr (W T X T X (I - Z - Z T + ZZ T) X T XW) s . t . W T X T XX T XW = I

(17)

Let $K = X T X$ , generally any kernel function can be used to construct the kernel matrix K. In this paper, the linear kernel function $K ij = K (x i, x j) = x_{i}^{T} x j + c$ is used to construct kernel matrix, therefore we have

min W Tr (W T K (I - Z - Z T + ZZ T) KW) s . t . W T KKW = I

(18)

The learning of P is converted to solve the problem (18), and it is easy to get the equivalent representation of equation (18)

max W Tr (W T KAKW) s . t . W T KKW = I

(19)

where $A = Z + Z T - ZZ T$ . By using the Lagrange multiplier method, equation (19) becomes

L (W) = W T KAKW - ρ Tr (W T KKW - I)

(20)

and let $\frac{\partial L (W)}{\partial W} = 0$ , we have

KAKW = ρ KKW

(21)

Let $W = [w 1, \dots, w d]$ be the solution of the generalized eigenvalue problem (21), these column vectors are the eigenvectors corresponding to the first d largest eigenvalues. Therefore, solution P^*= W^TX^T can be obtained.

Learning Z while fixing P

When P is fixed, the learning of Z is equivalent to solve the following problem (22) in the model (12).

min Z ‖ Z ‖ 1 + \frac{λ}{2} | | Y - YZ {| |}_{F}^{2} s . t . diag (Z) = 0

(22)

Where Y = PX.

Because the problem (22) and the original SSC objective function (3) are the same, the original algorithm SSC can be used directly to get the solution Z^* of the problem (22).

Optimization solution of PLRR model

The method to solve P and Z in PLRR is similar to PSSC, the solutions of P and Z can be obtained by the alternative optimization approach.

Learning P while fixing Z Similar to PSSC, the model (14) of PLRR is equivalent to the above model (16) when Z is fixed. Let $P = W T X T$ and introduce the kernel matrix K, the problem to solve P in equation (16) is converted to solve the problem (19). We can obtain matrix W by solving the generalized eigenvalue problem of equation (21), and then get P by $P = W T X T$ .

Learning Z while fixing P When P is fixed, in order to optimizing Z, the model (14) can be converted to

min Z ‖ Z ‖ * + \frac{λ}{2} ‖ PX - PXZ ‖ F 2

(23)

We first convert problem (23) to the following equivalent problem

min J, Z, P, E ‖ J ‖ * + \frac{λ}{2} ‖ E ‖ F 2 s . t . E = PX - PXZ, Z = J

(24)

To solve equation (24), we have the following model by the augmented Lagrange multiplier¹⁴

min Z, E, J, Y 1, Y 2 ‖ J ‖ * + \frac{λ}{2} ‖ E ‖ F 2 + Tr [Y_{1}^{T} (Y - YZ - E)] + Tr [Y_{2}^{T} (Z - J)] + \frac{μ}{2} (| | Y - YZ - E | | F 2 + | | Z - J | | F 2)

(25)

In fact, the solving process of equation (21) is similar to LRR, which just have some difference on updating formalization of E. When using F-norm, the updated E is

E = \arg min E \frac{λ}{μ} ‖ E ‖ F 2 + ‖ E - (Y - YZ - Y 1 / μ) ‖ F 2

(26)

so we obtain the solution of equation (22) as follow

E = \frac{μ}{λ + μ} (Y - YZ + \frac{Y 1}{μ})

(27)

More details about LRR algorithm can are found in Liu et al.,¹⁰ then we can get the solution of Z.

Optimization solution of PLSR model

Similar to PSSC, PLSR model also has two variables P and Z and the solutions of P and Z can be obtained by the alternative optimization approach.

Learning P while fixing Z PLSR model (15) is equivalent to the above problem (16) when Z is fixed, so the solution of P can be obtained by solving the generalized eigenvalue problem (21), where the eigenvectors corresponding to the first d largest eigenvalues composite the matrix W and P^*= W^TX^T.

Learning Z while fixing P When P is fixed, the learning of Z is equivalent to solve the following problem

min Z ‖ Y - YZ ‖ F 2 + λ ‖ Z ‖ F 2

(28)

where Y = PX.

Let objective function $J (Z) = ‖ Y - YZ ‖ F 2 + λ ‖ Z ‖ F 2$ and the derivative of J(Z) with respect to Z is equal to 0. The analytical solution of Z can be obtained by using the following formulation

Z * = (Y T Y + λ I) - 1 Y T Y

(29)

PSC algorithm

Similar to SSC, LRR or LSR, PSC is also a kind of spectral clustering-based method. We first solve the PSC problem (i.e. PSSC, PLRR or PLSR problem) by using the results in Optimization solution of PSSC model, Optimization solution of PLRR model and Optimization solution of PLSR model sections to get the solution $Z *$ , and then define an affinity matrix as $(| Z * | + | (Z *) T |) / 2$ . Finally, we can segment the gene expression data by using normalized cuts method¹⁶ on the affinity matrix. In summary, we have the entire PSC algorithm for clustering data in the following algorithm.

Algorithm: projection subspace clustering

Input: data matrix X, number of subspace k, regularization parameter λ, projection dimension d

Output: k clusters

Step 1: Solve the problem of PSSC, PLRR or PLSR using the method in Optimization solution of PSSC model, Optimization solution of PLRR model and Optimization solution of PLSR model sections, respectively to obtained the solution Z^*:

Step 1.1: Z is fixed, optimize P by equation (21);

Step 1.2: P is fixed, optimize Z by equations (22), (25) or (29), where $Y = PX$ ;

Until convergence;

Step 2: Calculate the affinity matrix by $(| Z * | + | (Z *) T |) / 2$ ;

Step 3: Normalized cuts is applied to divided data into k clusters.

Experimental verification

In order to evaluate the effectiveness of the proposed PSC methods, we apply these methods into gene expression data clustering problem. In this paper, all the experiments are carried out by using Matlab 2010b on a PC with 1.6 GHz CPU and 2 GB RAM.

Data sets

In this paper, six common gene expression data sets are used to experiment, including 9_tumors,¹⁷ BrainTumor,¹⁸ DLBCL,¹⁹ Leukemia1,²⁰ Lymphoma²¹ and SRBCT.²² A brief summary of the data sets is shown in Table1. More details about the data sets can be found in Statnikov et al.²³

Table 1.

Summary of the gene expression data sets.

Data sets	Samples	Genes	Categories
9_Tumors	60	5726	9
BrainTumor	90	5920	5
DLBCL	77	5469	2
Leukemia1	72	5327	3
Lymphoma	62	4026	3
SRBCT	83	2308	4

Experimental results and analysis

We compare the proposed PSSC, PLRR and PLSR approach with the following methods: subspace segmentation methods: SSC,⁹ LRR¹⁰ and LSR,¹¹ NMF based methods: Convex NMF (CNMF), semi-NMF (SNMF).²⁴ The clustering results can be evaluated by the clustering accuracy (ACC).²⁵

The parameter d of PSSC, PLRR or PLSR is set as {1,5,10,15,20,25,30}. The regularization parameter λ of PSSC, PLRR, PLSR, SSC, LSR or LRR is {0.0001,0.001,0.005,0.01,0.05, 0.1,0.5, 1,5,10}. In order to avoid randomness of the results, each comparison algorithm runs 10 times and the mean of clustering accuracy ± variance (regular parameter/dimension and regular parameter) are shown in Table 2, bold values is the highest accuracy in the compared methods.

Table 2.

Clustering accuracy comparison (ACC% ± Var).

Data sets	CNMF	SNMF	SSC	PSSC	LRR	PLRR	LSR	PLSR
9_Tumors	42.33 ± 2.63	40.33 ± 3.58	42.50 ± 3.07 (0.0001)	42.83 ± 2.49 (25, 10)	38.67 ± 1.05 (1)	41.67 ± 1.11 (5, 5)	47.67 ± 2.51 (10)	48.67 ± 2.70 (10, 0.5)
BrainTumor	44.67 ± 2.39	40.78 ± 3.59	64.78 ± 0.54 (0.001)	69.11 ± 0.70 (1, 0.05)	68.89 ± 0.00 (5)	69.22 ± 10.32 (15, 10)	72.33 ± 1.61 (0.005)	75.11 ± 2.67 (20, 0.005)
DLBCL	67.92 ± 1.63	67.92 ± 1.84	76.49 ± 0.41 (0.1)	76.23 ± 0.00 (1, 0.001)	76.23 ± 0.00 (5)	79.35 ± 0.41 (30, 0.05)	76.23 ± 0.00 (0.0001)	72.73 ± 0.00 (15, 0.0001)
Leukemia1	52.08 ± 7.81	63.61 ± 7.20	77.78 ± 3.59 (0.0001)	87.50 ± 0.00 (25, 10)	63.89 ± 0.00 (1)	83.33 ± 0.00 (10, 1)	63.89 ± 2.54(0.05)	87.50 ± 0.00 (20, 0.0001)
Lymphoma	65.32 ± 8.51	67.26 ± 7.61	91.94 ± 0.00 (10)	100 ± 0.00 (20, 10)	74.84 ± 8.09 (5)	100 ± 0.00 (20, 5)	67.74 ± 0.00 (0.01)	100 ± 0.00(20, 0.0001)
SRBCT	48.80 ± 8.31	50.24 ± 2.05	52.05 ± 8.23 (0.1)	58.19 ± 2.67 (5,10)	60.84 ± 1.91 (1)	69.16 ± 6.10 (10, 10)	51.81 ± 3.64 (10)	74.70 ± 7.81 (10, 0.005)

As can be seen from the experimental results shown in Table 2, we have the following observations:

PSC methods outperform other algorithms on five out of six data sets. Specifically, the clustering accuracy of PSSC, PLRR and PLSR achieve 100% in the Lymphoma data set, significantly higher than the other methods. The clustering accuracy of three PSC methods PSSC, PLRR and PLSR are higher than that of SSC, LRR and LSR, respectively.

Comparing with the clustering accuracy of three different PSC methods, it can be found that PLSR is the best one among PSSC, PLRR and PLSR, the performance of LSR is better improved by PLSR.

The clustering accuracy of NMF-based methods, that is CNMF and SNMF are not prominent, because they are not suitable for clustering gene expression data set for its high-dimension. A possible reason is the solutions of CNMF and SNMF are local optimization solution.

From these discoveries mentioned above, we can conclude that PLRR and PLSR are the better methods to cluster gene expression data.

In the aspect of time-consuming, PSC method achieve dimension reduction and subspace clustering at the same time, which leads to the increasing of time-consuming. Qiu and Sapiro¹² mentioned that SSC is the slowest method among three traditional subspace clustering methods, followed by LRR, and LSR is the fastest one. Similarly, the PSC method PSSC is the slowest among three PSC methods, followed by PLRR, PLSR is also the fastest. Moreover, PSSC is the slowest one among the above eight algorithms.

Dimensionality reduction by PSC

PSC method extends subspace clustering methods by dimension reduction projection technique. The effects of dimension reduction of three PSC methods on six data sets are shown in Figure 2.

Figure 2.

Accuracy of three PSC methods on different dimensions.

From Figure 2, it is not difficult to find that PSC method PLSR can achieve optimal clustering accuracy on most of datasets, and the optimal clustering accuracy of PLSR basically appears in the 10–20 dimension. When the dimension is 10–20, three PSC methods basically can achieve a higher clustering accuracy, which shows the PSC method has a good capability of dimension reduction. However, if the dimension is lower or higher, the clustering accuracy of PSC methods is lower, so it is important that the appropriate dimension is chosen.

PCA and projection space clustering algorithm

In literature,^10–12 the data dimension must be reduced by using PCA firstly, and then the subspace clustering method is performed. The experiments in Experimental results and analysis section cluster the data directly without PCA dimension reduction. Thus, in order to compare the clustering accuracy between PCA dimension reduction and the PSC methods, the experimental datasets is reduced to 30 dimensions uniformly. The other experimental parameters are consistent with Experimental results and analysis section, and the experimental results are shown in Table 3.

Table 3.

Clustering accuracy comparison on 30 dimensions.

Data sets	CNMF	SNMF	SSC	PSSC	LRR	PLRR	LSR	PLSR
9_Tumors	52.05 ± 10.44	52.89 ± 6.16	63.37 ± 7.05 (0.5)	58.07 ± 6.06 (5, 10)	51.57 ± 1.68 (7)	76.51 ± 5.72 (10, 10)	55.42 ± 0.00 (1)	71.57 ± 2.29 (10, 10)
BrainTumor	43.33 ± 2.87	42.44 ± 3.43	60.44 ± 10.23 (5)	64.61 ± 0.47 (1,5)	64.33 ± 5.17 (5)	73.78 ± 2.73 (25,10)	70.44 ± 2.11 (0.1)	68.67 ± 10.13 (20,0.05)
DLBCL	74.68 ± 0.68	69.87 ± 2.65	79.22 ± 0.00 (0.1)	77.92 ± 0.00 (5, 1)	80.52 ± 0.00 (0.5)	88.83 ± 0.67 (5, 1)	80.52 ± 2.74 (1)	88.96 ± 0.68 (5, 0.5)
Leukemia1	59.31 ± 2.93	59.03 ± 3.77	79.17 ± 14.34 (5)	66.53 ± 6.36 (5, 5)	58.33 ± 5.44 (1)	91.67 ± 0.00 (15, 10)	66.11 ± 0.72 (0.5)	91.67 ± 0.00 (15, 0.0001)
Lymphoma	61.13 ± 7.51	68.55 ± 9.66	93.55 ± 0.00 (5)	94.68 ± 6.63 (15, 10)	72.74 ± 0.51 (0.5)	98.39 ± 0.00 (15, 5)	67.74 ± 0.00 (1)	100 ± 0.00 (20, 0.0001)
SRBCT	45.67 ± 8.22	46.27 ± 6.33	61.93 ± 7.37 (0.01)	49.28 ± 1.92 (5, 5)	51.57 ± 1.68 (0.5)	44.34 ± 1.59 (5, 0.001)	55.42 ± 0.00 (1)	78.31 ± 0.00 (10, 0.005)

Based on experimental results shown in Tables 2 and 3, we can found the clustering accuracy of the traditional subspace clustering algorithms and PSC algorithms are improved. Specially, the PSC algorithm PLSR have the most clustering accurate in all the methods mentioned in our study except data set 9_Tumors and BrainTumor, and the clustering accuracy of the PSC algorithms PLRR and PLSR, respectively greater than the classical subspace algorithms LRR and LSR, the result is consistent with which are in Table 2.

Parameter selection

The PSC model has two main parameters: projection dimension d and regularization parameter λ. On the whole, the different parameter settings will influence clustering accuracy. Figure 2 describes the effect on clustering accuracy when change the parameters d, which can get the better result when the dimension parameter d is 10–20. If the projection dimension is too high or too low, clustering accuracy will be reduced. The regularization parameter selection has the similar phenomenon. Similarly, we can also find this phenomenon in the experiment of selecting regular parameter λ. From Tables 1 and 2, regular parameters λ of PSSC and PLRR should be taken in 1–10, of which PLSR should be taken in 0.0001–0.01.

Conclusions

In this paper, we propose PSC, which combines projection technology with subspace clustering method and is applied in the gene expression data with high-dimensional small sample size. The experimental results show that the three PSC methods PSSC, PLRR and PLSR are more suitable for clustering gene expression data than the traditional subspace clustering method SSC, LRR and LSR. However, how to efficiently select projection dimension and regular parameter λ is a problem deserving future research.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by National Natural Science Foundation of China (grant no. 71273053 and 11571074) and Natural Science Foundation of Fujian Province (Grant no. 2014J01009).

References

Zheng

Zhang

et al.

Metasample-based sparse representation for tumor classification. IEEE/ACM Trans Comput Biol Bioinform 2011; 8: 1273–1282.

Gao

Church

. Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics 2005; 21: 3970–3975.

Chen

Jian

. Gene expression data clustering based on graph regularized subspace segmentation. Neurocomputing 2014; 143: 44–50.

Huang

Rong

et al.

Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 2005; 27: 657–668.

Aggarwal

Wolf

et al.

Fast algorithms for projected clustering. ACM SIGMOD Record. ACM 1999; 28: 61–72.

Wold

Esbensen

Geladi

. Principal component analysis. Chemometr Intell Lab Syst 1987; 2: 37–52.

Boult TE and Brown LG. Factorization-based segmentation of motions. In: Proceedings of the IEEE Workshop on Visual Motion. Princeton, NJ, USA: IEEE, 1991, pp.179–186.

Vidal

. A tutorial on subspace clustering. IEEE Signal Process Mag 2010; 28: 52–68.

Elhamifar E and Vidal R. Sparse subspace clustering. Computer vision and pattern recognition, 2009. Miami, FL, USA: IEEE, 2009, pp.2790–2797.

10.

Liu G, Lin Z and Yu Y. Robust subspace segmentation by low-rank representation. In: Proceedings of the 27th international conference on machine learning (ICML-10) 2010. Haifa, Israel: Omnipress, 2010, pp.663–670.

11.

Lu CY, Min H, Zhao ZQ, et al. Robust and efficient subspace segmentation via least squares regression. Computer Vision–ECCV 2012. Berlin/Heidelberg: Springer, 2012, pp.347–360.

12.

Qiu

Sapiro

. Learning robust subspace clustering. Ann Stat 2014; 42: 669–699.

13.

Feng J, Lin Z, Xu H, et al. Robust subspace segmentation with block-diagonal prior. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2014. Columbus, OH, USA: IEEE, 2014, pp.3818–3825.

14.

Shen

Wen

Zhang

. Augmented Lagrangian alternating direction method for matrix separation based on low-rank factorization. Optim Methods Softw 2014; 29: 239–263.

15.

Wang

You

et al.

Fast kernel Fisher discriminant analysis via approximating the kernel principal component analysis. Neurocomputing 2011; 74: 3313–3322.

16.

Shi

Malik

. Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 2000; 22: 888–905.

17.

Staunton

Slonim

Coller

et al.

Chemosensitivity prediction by transcriptional profiling. Proc Natl Acad Sci USA 2001; 98: 10787–10792.

18.

Pomeroy

Tamayo

Gaasenbeek

et al.

Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 2002; 415: 436–442. .

19.

Khan

Wei

Ringner

et al.

Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 2001; 7: 673–679.

20.

Golub

Slonim

Tamayo

et al.

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286: 531–537.

21.

Alizadeh

Eisen

Davis

et al.

Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000; 403: 503–511.

22.

Khan

Wei

Ringner

et al.

Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 2001; 7: 673–679.

23.

Statnikov

Tsamardinos

Dosbayev

et al.

GEMS: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data. Int J Med Inf 2005; 74: 491–503.

24.

Ding

Jordan

. Convex and semi-nonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell 2010; 32: 45–55.

25.

Cai D, He X, Wu X, et al. Non-negative matrix factorization on manifold. In: Proceedings of the Eighth IEEE International Conference on Data Mining 2008. Pisa, Italy: IEEE, 2009, pp.63–72.