Abstract
Gene expression data is a kind of high dimension and small sample size data. The clustering accuracy of conventional clustering techniques is lower on gene expression data due to its high dimension. Because some subspace segmentation approaches can be better applied in the high-dimensional space, three new subspace clustering models for gene expression data sets are proposed in this work. The proposed projection subspace clustering models have projection sparse subspace clustering, projection low-rank representation subspace clustering and projection least-squares regression subspace clustering which combine projection technique with sparse subspace clustering, low-rank representation and least-square regression, respectively. In order to compute the inner product in the high-dimensional space, the kernel function is used to the projection subspace clustering models. The experimental results on six gene expression data sets show these models are effective.
Keywords
Introduction
In the past few decades, many pattern recognition methods have been applied on the research for gene expression data, such as the improved sparse expression, 1 the sparse nonnegative matrix factorization (NMF) 2 and graph regularized-based subspace clustering. 3 The biggest challenge for the study of gene expression data is its high-dimensional and small sample size problem, because the gene expression data have usually only tens to hundreds of samples and each sample contains thousands or even tens of thousands of genes. Despite these difficulties, there are still many classification and clustering methods used for gene expression data analysis. In this paper, we will do some research on clustering of gene expression data.
Many traditional clustering methods in low-dimensional data can achieve perfect results, while good clustering results are not available on high-dimensional data, such as k-means, 4 k-medoid, 5 hierarchical clustering and self-organizing map. For the high-dimensional data, the usual method is to reduce the dimensions by applying the principal component analysis (PCA) 6 first, and then implement the clustering algorithms. 7 However, PCA is a linear unsupervised dimension reduction method. Its goal is to learn an orthogonal transformation such that the transformed data have the largest possible variance while ignore the potential structure between categories from a global point of view.
For example, the Figure 1 shows the data of two categories that are randomly generated from two-dimensional normal distribution; each one has 100 data points. As shown in Figure 1(a), the solid line A and the broken line B represent the first projection direction of PCA and the appropriate projection direction, respectively. Figure 1(b) shows the projected data, which is obvious that B is the better projection direction than A, and the literature
8
also had the same conclusion. When the number of subspaces (i.e. the number of clusters) is 1, a perfect low-dimensional expression can be gotten by PCA, while when the number of subspaces is larger than 1, PCA is not a good approach of dimension reduction. High-dimensional data space has more subspaces, which is the reason that PCA cannot find one-to-one correspondence dimension between subspaces.
Compare the projection direction of PCA with the appropriate projection direction.
Inspired by subspace segmentation methods such as sparse subspace clustering (SSC), 9 low-rank representation subspace clustering (LRR) 10 and least-square regression subspace clustering (LSR) 11 which regard similar gene expression data as a subspace, we propose a projection subspace clustering (PSC) in order to solve the high-dimensional non-linear problem, which can achieve dimension reduction and subspace segmentation simultaneously.
The organization of the rest of this paper is as follows. In Subspace clustering section, we review some related work of subspace clustering, such as sparse SSC, LRR and LSR. Projection subspace clustering section presents our PSC method. In Experimental verification section, experiments on gene expression data clustering are conducted. Conclusions are made in Conclusions section.
Subspace clustering
Subspace clustering is an important clustering method for machine learning, which has been successfully applied in machine vision and other fields, that is clustering and image representation.9,11 The goal of subspace clustering is to segment sample data or group samples into several clusters such that each cluster corresponds to a subspace, therefore, also known as subspace segmentation. (subspace clustering)
9
Definition
Given a data set
In the past two decades, many subspace segmentation methods have been proposed. Existing works on subspace segmentation can be roughly divided into four categories: algebraic methods, statistical methods, iterative methods and spectral clustering-based methods. 11 A review of subspace segmentation can be found in the literature. 8
The important task of spectral clustering-based subspace segmentation methods is to find an affinity matrix Sparse subspace clustering SSC is a sparse representation method, and usually solves the following L1-minimization problem because the original SCC optimization problem is non-convex and NP-hard
11
Low-rank representation subspace clustering LRR subspace clustering is a low-rank representation method, the original LRR solves the following rank minimization problem
where
where rank(Z) is the rank of Z. The rank minimization problem is also NP-hard. In Liu et al.,
10
the nuclear norm is used to instead of rank(Z). Thus, LRR is replaced to solve the following problem
where λ > 0, Least-square regression subspace clustering
LSR subspace clustering minimized the Frobenius-norm of Z, and its objective function is as follow
Its extended model for noisy case as follow
Remove the restrictive conditions, but also can be extended to
Subspace clustering algorithms described above have a good aggregation, especially the LSR method is robust and can quickly gets the representation factor. 11 But for clustering gene expression data, the effects of clustering are easily affected by noise when the above three subspace clustering methods are applied directly on the high-dimensional data.
Projection subspace clustering
In high-dimensional space, the existing subspace clustering method, such as SSC or LRR or LSR, is not only easily affected by noise, and is not conducive to discover the geometry structure of the nature of the data set. We introduce the projection technology to extend SSC, LRR and LSR, and propose the projection sparse subspace clustering (PSSC), projection low-rank representation subspace clustering (PLRR) and projection least-squares regression subspace clustering (PLSR).
The main idea of the PSC is to map the data set X from the original space into a new space by learning a linear projection matrix P, the obtained data set Y in the new space has Y = PX. In order to reduce the complexity of the inner product computation in high-dimensional space, the linear kernel function Kij = K(xi, xj) is introduced and let K = XTX. We respectively discuss the three PSC model as follows.
Models of PSC
Based on the discussion in Subspace clustering section, the general subspace clustering model can be summarized as follows
PSC model combines projection technology with subspace clustering to avoid trivial solution, we add the constraint condition
For different choices of Projection sparse subspace clustering PSSC model combines projection technology with SSC, to avoid trivial solution, we add the constraint condition
where P is a linear projection matrix, λ > 0. The formulation (12) is our proposed the PSSC model.
Projection low-rank representation PLRR combines projection technology with LRR. The original constraint item of LRR is to use L2,1-norm of the noise in the formulation (6), but in practical applications, F-norm constraint is often more efficient than L2,1-norm,
14
so the formulation (6) is modified as follows
PLRR with projection technology in equation (13) is
Projection least-square regression PLSR combined projection technique with LSR, to avoid trivial solution, we add the constraint condition
where P is a projection matrix and λ > 0. The formulation (15) is our proposed PLSR model.
Optimization solution of PSSC model
The PSSC model has two variables P and Z, and the solutions of P and Z can be obtained by the alternative optimization approach, which divide the problem into two steps: learning the projection matrix P while fixing the reconstruction coefficient matrix Z and learning Z while fixing P.
Learning P while fixing Z Because
The solution of equation (16) can be obtained by solving the generalized eigenvalue problem. However, the dimensionality of the matrix
Let
The learning of P is converted to solve the problem (18), and it is easy to get the equivalent representation of equation (18)
where
and let
Let Learning Z while fixing P
When P is fixed, the learning of Z is equivalent to solve the following problem (22) in the model (12).
Where Y = PX.
Because the problem (22) and the original SSC objective function (3) are the same, the original algorithm SSC can be used directly to get the solution Z* of the problem (22).
Optimization solution of PLRR model
The method to solve P and Z in PLRR is similar to PSSC, the solutions of P and Z can be obtained by the alternative optimization approach.
Learning P while fixing Z Similar to PSSC, the model (14) of PLRR is equivalent to the above model (16) when Z is fixed. Let Learning Z while fixing P When P is fixed, in order to optimizing Z, the model (14) can be converted to
We first convert problem (23) to the following equivalent problem
To solve equation (24), we have the following model by the augmented Lagrange multiplier
14
In fact, the solving process of equation (21) is similar to LRR, which just have some difference on updating formalization of E. When using F-norm, the updated E is
so we obtain the solution of equation (22) as follow
More details about LRR algorithm can are found in Liu et al., 10 then we can get the solution of Z.
Optimization solution of PLSR model
Similar to PSSC, PLSR model also has two variables P and Z and the solutions of P and Z can be obtained by the alternative optimization approach.
Learning P while fixing Z PLSR model (15) is equivalent to the above problem (16) when Z is fixed, so the solution of P can be obtained by solving the generalized eigenvalue problem (21), where the eigenvectors corresponding to the first d largest eigenvalues composite the matrix W and P* = WTXT. Learning Z while fixing P When P is fixed, the learning of Z is equivalent to solve the following problem
where Y = PX.
Let objective function
PSC algorithm
Similar to SSC, LRR or LSR, PSC is also a kind of spectral clustering-based method. We first solve the PSC problem (i.e. PSSC, PLRR or PLSR problem) by using the results in Optimization solution of PSSC model, Optimization solution of PLRR model and Optimization solution of PLSR model sections to get the solution Input: data matrix X, number of subspace k, regularization parameter λ, projection dimension d Output: k clusters Step 1: Solve the problem of PSSC, PLRR or PLSR using the method in Optimization solution of PSSC model, Optimization solution of PLRR model and Optimization solution of PLSR model sections, respectively to obtained the solution Z*: Step 1.1: Z is fixed, optimize P by equation (21); Step 1.2: P is fixed, optimize Z by equations (22), (25) or (29), where Until convergence; Step 2: Calculate the affinity matrix by Step 3: Normalized cuts is applied to divided data into k clusters.Algorithm: projection subspace clustering
Experimental verification
In order to evaluate the effectiveness of the proposed PSC methods, we apply these methods into gene expression data clustering problem. In this paper, all the experiments are carried out by using Matlab 2010b on a PC with 1.6 GHz CPU and 2 GB RAM.
Data sets
Summary of the gene expression data sets.
Experimental results and analysis
We compare the proposed PSSC, PLRR and PLSR approach with the following methods: subspace segmentation methods: SSC, 9 LRR 10 and LSR, 11 NMF based methods: Convex NMF (CNMF), semi-NMF (SNMF). 24 The clustering results can be evaluated by the clustering accuracy (ACC). 25
Clustering accuracy comparison (ACC% ± Var).
As can be seen from the experimental results shown in Table 2, we have the following observations:
PSC methods outperform other algorithms on five out of six data sets. Specifically, the clustering accuracy of PSSC, PLRR and PLSR achieve 100% in the Lymphoma data set, significantly higher than the other methods. The clustering accuracy of three PSC methods PSSC, PLRR and PLSR are higher than that of SSC, LRR and LSR, respectively. Comparing with the clustering accuracy of three different PSC methods, it can be found that PLSR is the best one among PSSC, PLRR and PLSR, the performance of LSR is better improved by PLSR. The clustering accuracy of NMF-based methods, that is CNMF and SNMF are not prominent, because they are not suitable for clustering gene expression data set for its high-dimension. A possible reason is the solutions of CNMF and SNMF are local optimization solution.
From these discoveries mentioned above, we can conclude that PLRR and PLSR are the better methods to cluster gene expression data.
In the aspect of time-consuming, PSC method achieve dimension reduction and subspace clustering at the same time, which leads to the increasing of time-consuming. Qiu and Sapiro 12 mentioned that SSC is the slowest method among three traditional subspace clustering methods, followed by LRR, and LSR is the fastest one. Similarly, the PSC method PSSC is the slowest among three PSC methods, followed by PLRR, PLSR is also the fastest. Moreover, PSSC is the slowest one among the above eight algorithms.
Dimensionality reduction by PSC
PSC method extends subspace clustering methods by dimension reduction projection technique. The effects of dimension reduction of three PSC methods on six data sets are shown in Figure 2.
Accuracy of three PSC methods on different dimensions.
From Figure 2, it is not difficult to find that PSC method PLSR can achieve optimal clustering accuracy on most of datasets, and the optimal clustering accuracy of PLSR basically appears in the 10–20 dimension. When the dimension is 10–20, three PSC methods basically can achieve a higher clustering accuracy, which shows the PSC method has a good capability of dimension reduction. However, if the dimension is lower or higher, the clustering accuracy of PSC methods is lower, so it is important that the appropriate dimension is chosen.
PCA and projection space clustering algorithm
Clustering accuracy comparison on 30 dimensions.
Based on experimental results shown in Tables 2 and 3, we can found the clustering accuracy of the traditional subspace clustering algorithms and PSC algorithms are improved. Specially, the PSC algorithm PLSR have the most clustering accurate in all the methods mentioned in our study except data set 9_Tumors and BrainTumor, and the clustering accuracy of the PSC algorithms PLRR and PLSR, respectively greater than the classical subspace algorithms LRR and LSR, the result is consistent with which are in Table 2.
Parameter selection
The PSC model has two main parameters: projection dimension d and regularization parameter λ. On the whole, the different parameter settings will influence clustering accuracy. Figure 2 describes the effect on clustering accuracy when change the parameters d, which can get the better result when the dimension parameter d is 10–20. If the projection dimension is too high or too low, clustering accuracy will be reduced. The regularization parameter selection has the similar phenomenon. Similarly, we can also find this phenomenon in the experiment of selecting regular parameter λ. From Tables 1 and 2, regular parameters λ of PSSC and PLRR should be taken in 1–10, of which PLSR should be taken in 0.0001–0.01.
Conclusions
In this paper, we propose PSC, which combines projection technology with subspace clustering method and is applied in the gene expression data with high-dimensional small sample size. The experimental results show that the three PSC methods PSSC, PLRR and PLSR are more suitable for clustering gene expression data than the traditional subspace clustering method SSC, LRR and LSR. However, how to efficiently select projection dimension and regular parameter λ is a problem deserving future research.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by National Natural Science Foundation of China (grant no. 71273053 and 11571074) and Natural Science Foundation of Fujian Province (Grant no. 2014J01009).
