Abstract
BACKGROUND:
In genome research, it is particularly important to identify molecular biomarkers or signaling pathways related to phenotypes. Logistic regression model is a powerful discrimination method that can offer a clear statistical explanation and obtain the classification probability of classification label information. However, it is unable to fulfill biomarker selection.
OBJECTIVE:
The aim of this paper is to give the model efficient gene selection capability.
METHODS:
In this paper, we propose a new penalized logsum network-based regularization logistic regression model for gene selection and cancer classification.
RESULTS:
Experimental results on simulated data sets show that our method is effective in the analysis of high-dimensional data. For a large data set, the proposed method has achieved 89.66% (training) and 90.02% (testing) AUC performances, which are, on average, 5.17% (training) and 4.49% (testing) better than mainstream methods.
CONCLUSIONS:
The proposed method can be considered a promising tool for gene selection and cancer classification of high-dimensional biological data.
Introduction
Microarray technology is one of the most recent advances in cancer research, and using this method, the expression levels of thousands of genes can be recorded simultaneously. In genomic analysis, the identification of molecular biomarkers or signal pathways associated with phenotypes is a particularly important issue. Logistic regression is a powerful discrimination method, enables clear statistical interpretation, and derives classification probability of classification label information.
From a biological point of view, only a few genes are related to the target disease, and most genes are not involved in cancer classification. Unrelated genes may cause noise and reduce the accuracy of prediction. In addition, from a machine learning perspective, too many features may cause overfitting and negatively affect classification performance.
The regularization method has been widely used when dealing with high-dimensional problems. A popular regularization method is the absolute shrinkage and selection operator (Lasso) or
In short, existing methods show good results in terms of feature selection and model construction, but they either cannot produce sufficient sparsity or do not use any network interaction knowledge.
In this paper, we investigate the sparse logistic regression model with a logsum network-based (Logsum-Net) penalty, in particular for gene selection in cancer classification.
The major contributions of this paper are as follows:
A new method of gene selection is put forward. The logsum method is an efficient tool for feature selection; however, this method is recommended from a strictly computational view, and there is no built-in design that can use a priori biological structure information. Unlike previous research, in this research, a Logsum-Net regularization that aims to integrate prior biological graph information is proposed. Beyond several mainstream methods, we have carried out a simulation experiment and experimented on a breast gene expression data set, and the experimental results show that the method is feasible. For the breast cancer data set, the AUC of our method is, on average, 5.17% (training) and 4.49% (testing) better than mainstream methods.
Suppose that dataset D has
where
Equation (2) is easily to overfitted when applied to a problem with high dimensions and low sample size. Regularization is a commonly used technique to solve high-dimensional problems, which can be expressed as:
where
where
There has been much research into network-based penalties. Li and Li [17], Chen et al. [18] and Wang et al. [19], for example, suggested a Lasso network-based method for the study of gene expression. However, the result achieved by the Lasso method is not sparse enough for genome data. In this paper, we propose a logsum network-based (Logsum-Net) method as follows:
where
where
This equation not only ensures sparsity in the solutions and, making them more appropriate for biological interpretation, but also smooths the regression coefficient of the genes that are connected in the network.
We first proposed a new threshold-based solver of the Logsum-Net penalty. Then, we applied an efficient coordinate descent algorithm (CDA) [20] to solve the Logsum-NL.
Threshold solver of the Logsum-Net penalty
Assuming a linear model with
For simplicity, the predictors and responses are all standardized. A Logsum-Net linear model can be expressed as:
where
Recall that
The first partial derivative concerning
where
By setting Eq. (3.1)
where
By the Taylor series method, we can rewrite Eq. (2) as follows:
where
Step 1: Set all
Simulation
Here, we have conducted simulation research to measure the gene selection and classify the ability of the proposed method. Several penalized logistic model technologies are compared in the experiment: the Lasso method, the L
Two models are suggested in the simulation. In every model, there were 200 instances: 100 training and 100 for testing.
In the first model, we assumed TF and its target played an activator or repressor role in the outcome variable:
In the second model, a TF could be both an activator and repressor at the same time, and the rest setting is similar to the first model:
The cross-validation (CV) technique has been widely used in parameter tuning. Here, we use a 10-CV method [21, 22] to identify the optimal tuning parameters for the training set. Genes with zero coefficients in the predicated model will be considered irrelevant to the predictor variables [23].
To consider the impact of variable correlation on the method more fully, a variable
The simulation process was repeated 500 times, and we use P and TP to report the feature selection ability of this method. P refers to the number of non-zero coefficient genes in the prediction model, and TP refers to the number of true non-zero coefficient genes in the model. The classification accuracy for the test set was also calculated, and Tables 1 and 2 summarize the results of each model.
Simulation study – gene selection performance
Simulation study – gene selection performance
Simulation study – classification performance
As shown in Table 1, compared to other algorithms, our method is more accurate at identifying real genes. For example, in Model 1, when
These results show that this method is a useful tool for classification and feature selection.
To further prove the performance of the proposed method, we compared our approach with the other five regularization methods in an analysis of TCGA breast cancer. This data describes 20,501 genes in 806 different breast cancer samples. We retained only samples with complete information. After that, 85 TNBC and 460 non-TNBC were further divided into two groups: training (
An extensive biological interactive network was obtained from BioGrid, which consists of 15,211 nodes (gene or other entities) and 336,119 interactions. A prepared network L with 11,320 genes and 224,458 edges was gained when we were mapping the downloaded network into the gene expression data.
We also added two new methods for performance comparison: SPL-Logsum [11] and HLR [7]. Table 3 shows that the Logsum-NL method gained higher predicting AUC performance than other mainstream regularization methods.
The results for breast cancer
The results for breast cancer
The top ten ranked genes
It can be seen from Table 4 that the genes identified by our Logsum-NL method include the SplA/Ryanodine Receptor Domain and SOCS Box intron 1 (SPSB 1), which has recently been identified as spontaneously regulated during breast tumor recurrence, and necessary and sufficient for promoting tumor recurrence [25]. The estrogen receptor (ESR 1) is one of the important markers for the classification of breast cancer subtypes in clinics, which can be used to not only guide prognosis but also decide treatment [26]. In breast tumors, protoprotein 10 (PCDH10) is down-regulated and methylated excessively [27]. The lymphocyte antigen 6 family member H (LY6H) is a cancer biomarker and therapeutic target that induces invasion and metastasis. LY6H is involved in the development of breast cancer by affecting the cellular pathway Ras/ERK. This gene may be a new marker for diagnosis and gene therapy in breast cancer patients [28].
The logsum method is a powerful method for feature selection. However, it is unable to use any previous biological structure information. To overcome this drawback, in this paper we first propose Logsum-Net regularization to integrate biological network knowledge. Then, we suggest the penalized logsum-net regularization logistic regression model (logsum-NL) for gene selection and cancer classification. For a real large dataset, the proposed method has achieved 89.66% (training) and 90.02% (testing) AUC performance which are, on average, 5.17% (training) and 4.49% (testing) better than mainstream methods. Therefore, the proposed logsum-NL method is a promising tool for gene selection and cancer classification of high-dimensional biological data. The limitation of this article is that it does not include an in-depth analysis of the selected genes. Future directions for research include further analysis of the potential clinical application of the selected genes, and investigation of the method with other high-dimensional data/models.
Footnotes
Acknowledgments
This research was supported by Macau Science and Technology Development Funds (0158/2019/A3, 0002/2019/APD) of Macau SAR of China, Special Innovation Projects of Universities in Guangdong Province (2018KTSCX205), the Natural Science Foundation of Guangdong Province (2018A030307033) and the National Natural Science Foundation of China (6201101081, 62006155).
Conflict of interest
None to report.
