Abstract
Identifying clinically relevant subtypes of a cancer using gene expression data is a challenging and important problem in medicine, and is a necessary premise to provide specific and efficient treatments for patients of different subtypes. Matrix factorization provides a solution by finding checkerboard patterns in the matrices of gene expression data. In the context of gene expression profiles of cancer patients, these checkerboard patterns correspond to genes that are up- or down-regulated in patients with particular cancer subtypes. Recently, a new matrix factorization framework for biclustering called Maximum Block Improvement (MBI) is proposed; however, it still suffers several problems when applied to cancer gene expression data analysis. In this study, we developed many effective strategies to improve MBI and designed a new program called enhanced MBI (eMBI), which is more effective and efficient to identify cancer subtypes. Our tests on several gene expression profiling datasets of cancer patients consistently indicate that eMBI achieves significant improvements in comparison with MBI, in terms of cancer subtype prediction accuracy, robustness, and running time. In addition, the performance of eMBI is much better than another widely used matrix factorization method called nonnegative matrix factorization (NMF) and the method of hierarchical clustering, which is often the first choice of clinical analysts in practice.
Keywords
Introduction
Microarray and RNA-seq technologies produce huge amount of gene expression datasets from which we could discover meaningful information for biological processes and many diseases. Gene expression data can be arranged in a matrix whose rows correspond to genes and columns correspond to different conditions (eg, different patients in the context of gene expression profiling of cancer patients). The values in this kind of matrix could represent either normalized gene expression levels (such as Affymetrix GeneChips) or relative gene expression ratios (such as cDNA microarrays). 1
Various methods have been developed for clustering genes or conditions that show similar expression patterns.2–5 Traditional clustering technologies only focus on one dimension, and partition either genes (Fig. 1A) or conditions (Fig. 1B) into different groups based on their similarities. Although useful, traditional clustering methods have limitations compared with biclustering methods in their ability to discover similarity of subgroups based on subsets of attributes. Cheng and Church
6
first introduced the concept of biclustering, which extends the traditional clustering technologies by simultaneously clustering both genes and conditions (Fig. 1C and D). Thus, some coexpressed genes under some conditions, corresponding to the sub-matrices of the raw matrix (called

Shuffled matrices obtained from different clustering methods, including (A) gene clustering, (B) condition clustering, (C) biclustering that generates overlapping biclusters, and (D) biclustering that reports checkerboard structure.
For the current molecular study of different cancers with large amount of datasets from different platforms, one critical computational challenge is to conduct unsupervised clustering analysis. Especially gene expression clustering is a key step in performing a cancer molecular study such as cancer class discovery class prediction, molecular subtyping, and identification of gene expression-based prognostic signatures. Every molecular subtyping study, eg, the study of different cancer subtype of the effort of The Cancer Genome Atlas (TCGA), involves application of a specifically selected clustering approach. Identifying clinically relevant subtypes of a cancer based on gene expression data has many important applications in medicine, and is a necessary premise to provide specific and efficient treatments for patients of different subtypes.12–15 However, most of the existing biclustering methods do not work well in predicting those subtypes from a cancer data because of the following reasons: (i) these methods often iteratively search for the biclusters one by one, and different biclusters may have overlaps (Fig. 1C), which results in an unreasonable fact that one patient could be classified into two different subtypes. (ii) Even though their reported clusters are non-overlapping (some methods have a specific parameter for this requirement), the methods still do not work well because they focus on finding local optimal clusters, instead of finding a global partition of columns. Thus, they usually report an unmeaningful classification such that some patients are not classified into any group.
Matrix factorization provides a solution for this outstanding problem by finding checkerboard patterns in the matrices of gene expression data (Fig. 1D). In cancer gene expression data analysis, these checkerboards correspond to genes that are obviously up- or down-regulated in patients with particular subtypes of tumors. Recently, a new matrix factorization framework for biclustering called maximum block improvement (MBI) 16 is proposed, but several problems exist and hinder its practical application in the cancer context.
In this study, we proposed an enhanced MBI (eMBI) method, which is more suitable to the problem of detecting different subtypes of cancer. Test results on several cancer datasets consistently indicate that eMBI has significant improvements in comparison with MBI, in terms of subtype prediction accuracy, robustness, and running time. eMBI is also demonstrated to have significantly better prediction accuracy than hierarchical clustering (HC) and another matrix factorization method called nonnegative matrix factorization (NMF), 17 and has important additional abilities, such as identifying potential marker genes.
Methods
We first revisit the framework of MBI and point out the problems that would affect its performance and hinder its practical application, particularly in the context of identifying subtypes of a cancer. Then, we propose solutions to each of those problems and verify their effectiveness on a benchmark dataset. Finally, we give an eMBI method, which is designed specifically for cancer subtype prediction.
Framework of MBI
The MBI method is proposed as a generic algorithm using the concept of tensor in the original paper.
16
The MBI method is based on a tensor optimization model. Consider the following formulation for the co-clustering problem for a given tensor dataset
is a row assignment matrix,
where
Mbi Method Described in Terms of a Matrix
Let

An example of MBI, (A) the raw matrix
We first define a
Next, an
Based on the three matrices defined above, we could get a matrix factorization of
The input of the algorithm is a 2D matrix

The pseudo code of the algorithm MBI.
Problems of applying MBI in practice
In this study, we always focus on the problem of identifying different subtypes of a cancer, which is an important application of gene expression data analysis in medicine. MBI addresses this problem by finding checkerboard patterns of the gene expression matrix, but the following problems exist and hinder its power in cancer gene expression data clustering.
The first problem is that the difference of the sizes of
The second problem is that both
The third problem is that MBI cannot even guarantee a stable solution; that is, it may report very different results in different runs. To address this problem, a consensus clustering strategy will be introduced to improve the robustness of MBI.
Solutions to the problems of MBI
The solutions to each of the problems mentioned above are described in detail as follows.
Gene filtering
In practice, clinical analysts usually select genes based on their experiences, instead of using all genes for cancer gene expression data analyses, such as cancer subtype prediction.1,18 This step can be done in a computational way, without any knowledge of the genes. With the reduction of the number of genes, the first problem would be solved. Intuitively, the selected genes should have the property of best partitioning the patients into distinct classes, which could be measured by the variances of the gene expression across all patients. One gene with little variance, that is, which has the same or similar expression across all patients, can provide no information for classification. Only genes with large variance can potentially mark different subtypes of cancer and provide possibility to cluster different patients. Therefore, we can only choose the top
Better initialization
Iterative algorithms, including MBI, usually cannot guarantee to get a global optimal solution, so how to select an initial value can significantly affect the final result. Instead of initializing randomly, we gave a relative better initialization for
Consensus clustering
For the third problem of MBI without guarantee to get a stable solution, we use a strategy called consensus clustering, which is first proposed by Monti et al.
19
and used in many other studies.20–22 The basic idea of consensus clustering is that one can discover clusters based on the consensus over multiple runs of a clustering algorithm with random restart. But, when the matrix is very huge, multiple runs of an algorithm become impossible, so people usually employ a subsampling technology such that each run begins with different subsamples of the original matrix, instead of the whole matrix. Since our gene-filtering procedure could greatly reduce the running time, such a subsampling step is not needed any more. A consensus matrix is defined, with the (
Framework of eMBI
Combining all the improvements mentioned above, we propose an eMBI method, which is designed specifically for cancer subtype prediction. The pseudo code of eMBI is shown in Figure 4.

The pseudo code of the algorithm eMBI.
Results
We tested our new method, eMBI, on three publicly available datasets, a summary of which is given below.
Data 1: A lung cancer dataset from Ref. 23, 56 samples belonging to four groups: normal subjects (Normal), pulmonary carcinoid tumors (Carcinoid), colon metastases (Colon), and Small cell carcinoma (SmallCell).
Data 2: A colorectal cancer (CRC) dataset from Ref. 24, 62 samples belonging to two dominant CRC subtypes.
Data 3: A non-small-cell lung cancer dataset from Ref. 25, two subtypes of samples: 40 adenocarcinoma (AC) samples and 18 squamous cell carcinoma (SCC) samples.
Interested readers could find more detailed information related to these datasets from the corresponding reference papers and these datasets are downloaded from the websites of the reference papers.
First, the effectiveness of each solution mentioned above is verified on a well-studied benchmark dataset (Data 1). Then, we further evaluate the overall outperformance of eMBI using a CRC dataset (Data 2) in a recent study. Finally, a non-small-cell lung cancer dataset (Data 3) is used to compare eMBI with two other methods: one is a matrix factorization method called NMF 17 and the other one is HC, which is often the first choice of clinical analyst in practice.
For the following testing results, when the sample grouping or sample subtype information is available,
Effectiveness of each solution
We verify the effectiveness of our solutions to the problems of MBI on a benchmark dataset (Data 1), which consists of 12,625 genes and 56 subjects 23 belonging to four groups: normal subjects (Normal), pulmonary carcinoid tumors (Carcinoid), colon metastases (Colon), and small cell carcinoma (SmallCell).
We chose the top 20% genes with biggest variances, and checked the effectiveness of this gene-filtering procedure on Data 1. The testing results based on the average performances of 10 runs are shown in Figure 5. It is obvious that our gene-filtering procedure can dramatically reduce the running time (Fig. 5B), and to our surprise, can even greatly improve the accuracy by ~8% (Fig. 5A).

(A) The gene filtering procedure can greatly improve the accuracy, and (B) The gene filtering procedure can dramatically reduce the running time. Effectiveness of our gene-filtering procedure.
Our initialization strategy also works very well in comparison with random initialization (Table 1). In fact, a new initialization using the
Comparing the performances of different initialization methods.
We also tested the benefits of consensus clustering on Data 1. As we expected, MBI exhibits distinct accuracy in different runs, but consensus clustering shows much more robustness and has higher accuracy (Fig. 6).

Consensus clustering is robust among 10 runs, while MBI is not.
Overall outperformance of eMBI
We further evaluate the overall outperformance of eMBI using a CRC dataset. The dataset (Data 2) we consider here contains 54,675 genes and 62 samples belonging to two dominant CRC subtypes. According to the study of Schlicker et al, 24 these two subtypes can be further divided into five subtypes that exhibit activation of specific signaling pathways.
We first detected the two major subtypes using MBI, then our new program eMBI. The comparison results are shown in Table 2, in which the accuracy and running time are both based on average values of 10 different runs. eMBI runs five times faster than MBI (Table 2), and more importantly, eMBI has much higher accuracy. The improvements of eMBI are more significant when we further divide two subtypes into five subtypes (Table 3). To further check the accuracy of each method in each run, we can see that eMBI is very robust, while MBI is not (Fig. 7). Basically, eMBI greatly outperforms MBI in prediction accuracy, robustness, and running time.

eMBI is robust in 10 different runs, while MBI is not.
Comparing eMBI with MBI on Data 2 (#subtype: 2).
Comparing eMBI with MBI on Data 2 (#subtype: 5).
Comparison eMBI with other methods
To compare eMBI with other methods, we consider another matrix factorization method, NMF, and HC method. A non-small-cell lung cancer dataset (Data 3) in recent study 25 is used as our test data. This dataset is composed of two subtypes of samples: 40 AC and 18 SCC samples. The comparison result is shown in Table 4. HC is the fastest method, but it has the worst accuracy. NMF also runs very fast, with a little higher accuracy than HC, while its performance in prediction accuracy is much worse than MBI and eMBI. MBI exhibits a higher accuracy than those of both NMF and HC, but unluckily, it runs too slow. Our eMBI runs about 10 times faster than MBI, and more importantly, it has the highest accuracy, which is the most important measure for the problem of clustering for cancer subtypes.
Comparing different clustering methods on Data 3.
Also note that by simultaneously clustering genes and conditions, eMBI can potentially provide useful information to identify marker genes, which is an important goal in the medicine research field. For example, by checking each gene cluster of eMBI, we can find gene clusters in which genes express differently in different patient groups. One gene cluster containing 92 genes of Data 1 is shown in Figure 8. This gene cluster can significantly classify the four different types of the patient samples and potentially include the candidate marker genes. Although NMF is also a matrix factorization method, it represents the genes with a small number of metagenes, and hence cannot capture marker genes effectively.

A cluster of genes which can classify four different types of cancer patient samples of Data 1. Here, each curve corresponds to the expression of one gene. The
Conclusions
A challenging and important problem in medicine is to identify clinically relevant subtypes of a cancer using gene expression data. In this study, we develop effective strategies to tailor a recently proposed method MBI for this problem, and implement a new open-source program called eMBI (the MAT-LAB source code version is available at: http://bioinformatics.astate.edu/). Test results on several cancer data consistently indicate that eMBI has greater improvement in comparison with MBI, in the sense of cancer subtype prediction accuracy, robustness, and running time. The HC method, like many other traditional clustering methods, works in this situation, but it is not a good choice because of its low accuracy. Clearly, advanced knowledge of gene expression data clusters can help in clustering cancer patients into clinically relevant subtypes. In the future, we will further improve the prediction accuracy of eMBI, and pay more attention to identification of marker genes. We will develop eMBI to automatically detect those interesting gene clusters and identify effective maker genes, which will benefit cancer gene expression studies and future clinical applications.
Author Contributions
XH conceived and designed the study. ZC and CA implemented and tested the eMBI method. ZC, ZW, and CZ compared eMBI with other methods and prepared the figures and tables. ZC wrote the manuscript. GL, SZ, and XH revised the manuscript. All authors reviewed and approved of the final manuscript.
Footnotes
Acknowledgment
We would like to thank Qin Ma and Juntao Liu for their helpful suggestions.
