Machine learning algorithm and deep neural networks identified a novel subtype in hepatocellular carcinoma

Abstract

BACKGROUND:

Hepatocellular carcinoma (HCC) is one of the most common malignant tumors. Due to the lack of specific characteristics in the early stage of the disease, patients are usually diagnosed in the advanced stage of disease progression.

OBJECTIVE:

This study used machine learning algorithms to identify key genes in the progression of hepatocellular carcinoma and constructed a prediction model to predict the survival risk of HCC patients.

METHODS:

The transcriptome data and clinical information were downloaded from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO). The differential expression analysis and COX proportional-hazards model participated in the identification of survival-related genes. K-Means, Random forests, and LASSO regression are involved in identifying novel subtypes of HCC and screening key genes. The prediction model was constructed by deep neural networks (DNN), and Gene Set Enrichment Analysis (GSEA) reveals the metabolic pathways where key genes are located.

RESULTS:

Two subtypes were identified with significantly different survival rates ( $p<$ 0.0001, AUC $=$ 0.720) and 17 key genes associated with the subtypes. The accuracy rate of the deep neural network prediction model is greater than 93.3%. The GSEA analysis found that the survival-related genes were significantly enriched in hallmark gene sets in the MSigDB database.

CONCLUSIONS:

In this study, we used machine learning algorithms to screen out 17 genes related to the survival risk of HCC patients, and trained a DNN model based on them to predict the survival risk of HCC patients. The genes that make up the model are all key genes that affect the formation and development of cancer.

Keywords

Hepatocellular carcinoma TCGA machine learning deep neural network prediction model survival

Abbreviation

DEGs, Differentially Expressed Genes; DNN, Deep neural networks; FPKM, Fragments Per Kilobase of exon model per Million mapped fragments; GEO, Gene expression omnibus; GSEA, Gene Set Enrichment Analysis; HCC, Hepatocellular carcinoma; LASSO, Least absolute shrinkage and selection operator; ROC, Receiver operating characteristic; TCGA, The cancer genome atlas.

1. Introduction

Among the most common cancers in the world, hepatocellular carcinoma (HCC) ranks fifth and is also one of the top three causes of cancer-related deaths [1]. In North America and Europe, these historically low incidences of HCC have increased in recent years [2]. Moreover, the early symptoms of HCC patients are not obvious, and most of them are already in the late stage after discovery. Therefore, there is an urgent need for an accurate model to predict the prognostic survival risk of HCC patients to guide clinical treatment [3]. Furthermore, the high-throughput sequencing data and chip data that are publicly available around the world are increasing. Finding the characteristics of cancer from the enormous data requires efficient tools. Machine learning and deep learning are scientific researches on algorithms and statistical models. They can efficiently help us process many genes when constructing prognostic models [4]. There have been many precedents for the use of machine learning algorithms to build cancer prognosis prediction models in recent years. Yang et al. used the random forest algorithm to predict the risk of colorectal cancer recurrence. Xue et al. screened 12 genes for predicting the survival risk of lung adenocarcinoma patients through machine learning algorithms. Liu et al. used the LASSO algorithm to identify the melanoma immune-associated prognostic features [5, 6, 7].

Changes in cell gene expression and disorders of metabolic, pathological, and physiological signaling pathways can promote the course and progression of HCC [8, 9]. In addition, complex signaling pathway changes will significantly influence the migration and invasion of HCC cells. Cancer cells up-regulate the expression of various genes that promote metastasis to metastasize. An in-depth understanding of these molecular mechanisms may help develop effective metastasis-targeted therapies and improve the overall prognosis of HCC patients. In recent years, many researchers have established HCC prognostic prediction models and explored the influence of genes in the model on the formation and development of cancer from the molecular mechanism [10, 11, 12].

The study aims to use machine learning algorithms to screen prognosis genes of hepatocellular carcinoma (HCC), identify two novel subtypes through an unsupervised clustering method, and construct a prediction model with significant prognostic value using a deep neural network. We integrated differentially expressed genes from TCGA transcriptome data and GEO data sets to identify overlapping differentially expressed genes. Univariate COX and multivariate COX regression were used to identify survival-related genes. K-Means clustering classifies HCC patient samples based on these survival-related genes expression. HCC patients were divided into two subtypes with high and low survival rates. Random forest and LASSO regression algorithm were used to find the most important genes that distinguish the two subtypes with high and low survival rates, 17 overlapping genes were obtained. The differentially expressed genes between the two subtypes were annotated by KEGG, GO, and MSigDB databases. The unregulated expression of 17 genes in HCC patients provides us with a key basis for the identification of two new subtypes of HCC patients and their survival rates. Use the DNN model to train the weights of 17 gene features, and build a prognostic prediction model that can accurately predict the survival risk of HCC patients.

2. Material and methods

2.1 Data source and processing

In this study, we selected the gene expression microarray data numbered GSE112790 stored in the GEO database (https://www.ncbi.nlm.nih.gov/geo/query/acc. cgi?acc=GSE112790). We also downloaded the transcriptome data and sample clinical information on liver cancer (TCGA-LIHC) in the TCGA database. The Series Matrix File(s) of the GSE112790 dataset indicates that the gene chip data has been standardized by the RMA algorithm, which provides convenience for the subsequent difference analysis. The transcriptome data in the TCGA database was obtained by high-throughput sequencing, and we chose the expression data standardized by FPKM. Based on FPKM, we performed another TPM transformation of the expression data again. Differential expression analysis was performed on the two groups of data using the R package “limma” [13], and the criteria for algorithm analysis were set as $|\text{log2(FC)}|>$ 1 and FDR $<$ 0.05.

FPKM calculation formula [14]:

$\displaystyle\text{FPKM}=\frac{\textit{RM}_{g}*10^{9}}{\textit{RM}_{t}*L}$

$\textit{RM}_{g}$ : Number of reads mapped to the gene.

$\textit{RM}_{t}$ : Number of reads mapped to all protein-coding genes.

$L$ : Length of the gene in base pairs; Calculated as the sum of all exons in a gene.

2.2 COX regression model

The COX regression model is mainly used for the prognosis analysis of tumors and other chronic diseases, and the results obtained from the analysis can be directly applied to clinical applications [15]. Therefore, this analysis has a very critical role in the clinical diagnosis of cancer. This study uses the COX regression model to simultaneously evaluate the impact of multiple prognostic factors on the survival risk of patients. The above steps are all implemented with the “survminer” and “survival” packages in R. The COX model calculates the independent hazard ratio (HR) of the gene and the corresponding confidence interval, determines the significant prognostic gene through the $p$ value, and the global statistical significance of the model is obtained from the likelihood ratio test result. The COX model focuses on the relationship between the covariate $x_{i}$ and the survival hazard function $h$ , the survival hazard function can be determined by the following formula:

$\displaystyle h(t,X)=h_{0}(t)\cdot e^{(\beta_{1}x_{1}+\beta_{2}x_{2}+\ldots+% \beta_{m}x_{m})}$

$h(t,X)$ : the survival risk rate function at time $t$ .

$h_{0}(t)$ : the benchmark risk rate function at time $t$ .

$\beta_{1}$ , $\beta_{2}$ , …, $\beta_{m}$ : the partial regression coefficient of independent variables.

2.3 K-Means clustering algorithm

K-Means is an unsupervised clustering algorithm, which divides the overall sample into $K$ clusters according to the Euclidean distance between samples [16]. The algorithm aims to keep the samples in the cluster as close together as possible and keep the distance between each cluster as far as possible. Assuming that the sample is divided into $K$ clusters ( $C_{1}$ , $C_{2}$ , …, $C_{k}$ ), then the goal of the K-Means algorithm is to obtain the smallest square error $E$ that can be calculated by the following formula:

$\displaystyle E=\sum_{i=1}^{k}\sum_{x\in C_{i}}\left\|x-u_{i}\right\|_{2}^{2}$

Where $u_{i}$ is the mean vector of the corresponding cluster $C_{i}$ , or the centroid, and the mathematical expression is as follows:

$\displaystyle u_{i}=\frac{1}{|C_{i}|}\sum_{x\in C_{i}}x$

We calculate the Euclidean distance between HCC patient samples from TCGA for K-Means clustering and compare the contour widths under different clusters to determine the best $K$ value, all steps are implemented in R. The effect is most significant when the samples are divided into two clusters ( $K=$ 2) and survival analysis was subsequently performed between the two clusters. According to the K-M curves of survival analysis, there is a significant difference in survival rates between the two clusters of samples. The HCC patients are divided into high-risk and low-risk subtypes.

2.4 Random forest classification

Random forest is an excellent machine learning algorithm that performs well when applied to classification or regression. Because of its fast processing speed and ability to process large data sets, it has been widely used in the field of bioinformatics [17]. In addition, the random forest can measure the importance of features. The corresponding evaluation index is the Gini index. The larger the GINI index, the importance of this feature to the category vector is higher. We use the R package “randomforest” to implement the random forest algorithm. The high-risk and low-risk subtypes obtained through K-means clustering are used as the known category vector and the gene expression as the feature vector of random forest. After repeated training and learning, we got a random forest model with a very low out-of-bag error rate (OOB). According to the random forest algorithm principle, we also get each feature simultaneously, that is the GINI index of the gene, and sort it from high to low. The GINI index is calculated by the following formula:

$\displaystyle\mbox{GINI}=1-\sum_{k=1}^{K}P_{k}^{2}$

$P_{k}$ represents the probability of the selected sample in the $K$ category, and the corresponding probability of the sample being misclassified is $(1-P_{k})$ .

2.5 LASSO algorithm

One of the defects of the random forest algorithm is that it is prone to overfitting. LASSO is a regularization method, by adding the $L_{1}$ penalty constraint to the least square estimation so that the estimation of some coefficients is zero.

$\displaystyle\hat{\beta}_{\textit{Lasso}}=\arg\min_{\beta\in R^{d}}\left(\|Y-X% \beta\|^{2}+\lambda\sum_{j=1}^{d}|\beta_{j}|\right)$

Where, $Y$ is the response variable, $X$ is a covariate, $\beta$ is the regression coefficient, $\hat{\beta}_{\textit{Lasso}}$ is the LASSO coefficient, $d$ is the number of parameters, the number of features, $\lambda$ is the regularization parameter.

Figure 1.

The overall design of this study.

There is a screening process for variables in LASSO. Instead of fitting all variables at the beginning, LASSO selectively puts variables in to optimize the model and the performance parameters [18]. The coefficient $\lambda$ controls the complexity of LASSO. Increasing $\lambda$ is to increase the penalty for the multivariate linear model to achieve the purpose of screening variables. In addition, another parameter $\alpha$ controls the characteristics of the model when dealing with highly correlated data. We use the R package “glmnet” to implement the LASSO algorithm. By determining the optimal solution of the model, we can get some feature variables, that is, some most important genes.

2.6 Deep neural network model

A deep neural network (DNN) is also called a Multi-Layer perceptron (MLP), which consists of an input layer, hidden layer, and output layer. In the study, we choose python to build the DNN model and TensorFlow as the back-end tool of Keras. The first layer in front is the input layer, the middle is the hidden layer, and the last layer is the output layer. To verify the effectiveness of the DNN model, the HCC patient samples were randomly divided by 7:3. We put 70% of the samples into the DNN model to build a mature prognosis prediction model by continuously training and learning. The other 30% of samples are used for predictive analytics. The effect of learning is controlled by the accuracy of the model output and the cross-entropy loss, and the prediction results will be compared with the K-Means results in the confusion matrix.

2.7 Gene set enrichment analysis (GSEA)

Gene Set Enrichment Analysis (GSEA) is a computational method that calculates the enrichment score, estimates its significance level, and adjusts for multiple hypothesis testing [19]. The advantage of GSEA is that it can consider genes with small differences but important functions. We used the R package “clusterProfiler” to do GSEA, and the background gene selects all the gene sets in the MSigDB database (C1, C2, C3, C4, C5, C6, C7, C8, H). We also selected Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) to annotate differential genes. There are three main categories of annotations in the GO database: Molecular Function (MF), Biological Process (BP), and Cellular Components (CC). These three functional categories define and describe the function of a gene in many aspects [20]. The KEGG database annotates the metabolic pathways that genes participate in [21].

2.8 Statistical analysis

Figure 2.

Identifying the differentially expressed genes (DEGs) from two data sets. (A) Volcano plot presenting the differentially expressed genes (DEGs) between the tumor patient and normal group in GSE112790. (B) Volcano plot presenting the DEGs between the tumor patient and normal group in TCGA-LIHC. (C) Heatmap of differentially expressed genes (DEGs) expression. (D) Venn diagram presenting the number of the overlapped DEGs.

R software version 4.0.5 was used to do statistical analysis for this article. $p<$ 0.05 were considered statistically significant unless otherwise mentioned. Kaplan-Meier curves analyses and log-rank tests were performed by the package “survminer”. 10-fold cross validation was performed on the algorithms to obtain the optimal lambda values (the minimum lambda value) in LASSO models. Finally, we used the package “pROC” to perform receiver operating characteristic (ROC) curve analysis.

3. Results

3.1 Identification of the DEGs

Figure 1 shows the steps and logical sequence of this study. We downloaded 198 samples (including 183 patient samples and 15 normal samples) from GSE112790. Using the package “Limma” to identify the differentially expressed genes (DEGs), we obtained 624 down-regulated genes and 794 up-regulated genes. The corresponding volcano map is shown in Fig. 2A. We also downloaded 421 samples (including 371 patient samples and 50 normal samples) from TCGA and identified 464 down-regulated genes and 2394 up-regulated genes after the differential expression analysis. The corresponding volcano map is shown in Fig. 2B. The heatmap shows all the DEGs (Fig. 2C). We selected overlapping differentially expressed genes (DEGs) from two differential expression analyses (Fig. 2D).

3.2 Selection of key HCC survival-related genes

Figure 3.

Two novel subtypes based on unsupervised K-Means clustering. (A) The curve shows that when the number of clusters is 2, our data can have the best clustering effect. (B) A distribution diagram of two types of samples in a two-dimensional coordinate. (C) Kaplan-Meier survival analysis of the two novel subtypes based on K-Means clustering. The Red line indicates the low-risk group; the blue line indicates the high-risk group. (D) ROC curves show that the result of K-Means clustering is reliable. (E) The proportion of different cancer stages in the two groups (G1, G2, G3, G4). (F) The proportion of different cancer stages in the two groups (Stage I, Stage II, Stage III, Stage IV).

We use a univariate COX regression model in R to examine the prognostic effect of each gene. The screening threshold is $p<$ 0.05, indicating that its prognostic effect is statistically significant. There we got 546 differentially expressed genes that meet the conditions. Based on the univariate COX results, we further considered clinical factors, including gender, age, T stage, Stage, and Grade stage. Then we calculated the regression coefficient, hazard ratio, corresponding confidence interval, and the $p$ value of each gene through multivariate COX regression. Here we selected a total of 391 significant genes that are strongly related to prognosis, as the survival-related Genes for the next study.

3.3 Two novel subtypes based on unsupervised K-Means clustering

Figure 4.

Recognition of candidate genes in RandomForest. (A) 1000 decision trees are enough to make the error of the RandomForest model converge to a reliable interval. (B) MDS plot presenting sample distribution in C1 or C2. (C) ROC curves show that the result of RandomForest is reliable. (D) The top 50 were selected because of their Gini coefficient.

Unsupervised K-Means clustering uses the PAM function to calculate the contour width between clusters under different $K$ values, and the corresponding clustering effect is shown in Fig. 3A. When $K=$ 2, the K-Means model has the optimal clustering effect. The corresponding spatial distribution is shown in Fig. 3B. The 365 patients were clustered into two novel subtypes (243 low-risk patients (C1) and 122 high-risk patients (C2)). We used the Kaplan-Meier method to analyze the survival rates of the two subtypes and the K-M curve is shown in Fig. 3C ( $p<$ 0.0001). We built a K-Means clustering model that can accurately distinguish patients with different risks. It has a certain reference value for the prognosis of patients. The model provides the specific parameters for the ROC curve (Fig. 3D). To further determine the distribution of patients between high-risk and low-risk subtypes, we compared the proportion of patients with different AJCC grades between the two subtypes. The proportion of patients with advanced cancer in the high-risk subtypes is much higher than the proportion of patients with advanced cancer in the low-risk subtypes. (Fig. 3E–F).

3.4 Random forest and LASSO dimensionality reduction

Figure 5.

Recognition of candidate genes in LASSO. (A) 47 genes were selected on LASSO coefficient profiles. The colored curves correspond to these genes. (B) The process in which the coefficient gradually tends to 0 as the lambda increases. (C) The change of binomial deviance in the LASSO model as lambda increases tells us why 47 genes were selected. (D) The change of AUC value in the LASSO model.

The random forest model’s number of decision trees parameter (ntree) is optimized by the error-tree diagram (Fig. 4A), and “mtry” uses the default parameters. The feature vectors of the model are survival-related genes, and the category vector uses the high-risk and low-risk subtypes. After repeated training, the out-of-bag error rate of the model is 4.66%. We convert the proximity matrix into a distance matrix and calculate the importance of each MDS axis, thereby drawing an MDS diagram (Fig. 4B). The samples of patients with high-risk and low-risk are distributed at both ends of the MDS1 axis, and the corresponding ROC curve (Fig. 4C) also verifies that the random forest model is reliable. Sort genes by their GINI coefficients from high to low and output the top 50 most important genes (Fig. 4D).

We use the R package “glmnet” to implement LASSO logistic regression, with the family parameter set to “binomial”. Continuously increase the penalty (Lambda), so that the coefficients of each variable (gene) in the model are adjusted to tend to 0, and the coefficient visualization is shown in Fig. 5A and B. Cross-validation is a pretty choice for optimizing the model to ensure that the mean square error of the model meets the requirements (Fig. 5C). The area under the ROC curve corresponding to each model is stored in the AUC and output (Fig. 5D), where the AUC values are all greater than 0.9.

3.5 Establishment of DNN prediction model

Figure 6.

Establishment of DNN prediction model. (A) 17 final genes were selected by Random Forest and LASSO. (B) The confusion matrix shows the comparison between the results predicted by the DNN model and the original classification results of K-Means. (C) The accuracy of DNN model training data. (D) The accuracy of DNN model validation data. (E) The cross-entropy of DNN model training data. (F) The cross-entropy of DNN model validation data. (G) The Kaplan-Meier survival analysis shows the effect of the DNN model on the prediction data set classification.

The optimal solution of the LASSO model lies within one standard error of the minimum mean square error, where $\lambda=$ Lambda.lse. The model contains a total of 47 variables whose coefficients are not 0. Compared with the TOP50 genes screened by random forest, a total of 17 common genes (ALDH2, GYS2, SLC10A1, SLC27A5, SLC22A1, PKM, NCAPD2, HJURP, CKS2, MELK, CEP55, KIF18B, SAE1, CDCA3, AURKB, NUP62, MTFR2) were obtained (Fig. 6A). These 17 genes are used as feature vectors for DNN model training, and the category vector is the two groups of subtypes obtained by unsupervised clustering. The TCGA-LIHC dataset with complete clinical follow-up information was randomly divided into 1/3 (110 samples) as the prediction set of deep learning, and the remaining 2/3 samples were the training set (255 samples). The DNN model assigns weight coefficients to each gene through continuous learning, and the final model accuracy is close to 1. The Cross-Entropy Loss is close to 0 (Fig. 6C–F). Use the new 1/3 prediction set data to test the prediction effect of the DNN model, which can ensure the reasonable of the prediction model. Putting the results of DNN prediction and the results of K-Means clustering into a confusion matrix (Fig. 6B), the prediction error rates for the two subtypes were 2.50% and 6.67%, respectively. Survival analysis was performed on the predicted results and the K-M curve was drawn (Fig. 6G). It can be seen that the model can accurately predict the subtype of the patient, and the survival risk of the patient can be judged according to the survival rate of the subtype ( $p=$ 0.00027).

3.6 The results of gene set enrichment analysis

Figure 7.

KEGG pathway enrichment analysis of the DEGs between the C1 and C2. (A) KEGG annotation of the DEGs. The color of the dot indicates statistical significance. The size of the dot indicates the number of genes included. (B) The main metabolic pathways the DEGs are enriched in. Red bars represent pathways that are significantly enriched for up-regulated DEGs. Blue bars represent pathways that are significantly enriched for down-regulated DEGs.

Figure 8.

Gene Ontology (GO) functional enrichment analysis of the DEGs. (A) Up-regulated gene enrichment pathway in BP. (B) Down-regulated gene enrichment pathway in BP. (C) Up-regulated gene enrichment pathway in CC. (D) Down-regulated gene enrichment pathway in CC. (E) Up-regulated gene enrichment pathway in MF. (F) Down-regulated gene enrichment pathway in MF.

To explain the huge difference in survival between the two subtypes, we discarded the previous differential expression analysis based on the normal and tumor groups. Instead, we selected C1 (low risk) and C2 (high risk) as the new difference grouping. As a result, we obtained 1144 differential genes between the two novel subtypes. Then we used the R package “enrichGO” and “enrichKEGG” to perform a hypergeometric distribution test analysis on differential genes. The annotation results of the KEGG databases are shown in Fig. 7. The main enrichment pathway for differential genes is “Cell cycle” and “Complement and coagulation cascades” (Fig. 7A). Figure 7B shows the main metabolic pathways enriched for up-regulated and down-regulated differentially expressed genes between the two subtypes, respectively. The bubble graphs display the results of GO database annotations (Fig. 8). The up-regulated differentially expressed genes are mainly involved in the “small molecule catabolic process (BP)”, “blood microparticle (CC)”, and “iron ion binding (MF)” and other activities (Fig. 8A, C and E). The down-regulated differentially expressed genes are mainly involved in activities such as “organelle fission (BP)”, “chromosomal region (CC)” and “tubulin binding” (Fig. 8B, D and F). The bubble size indicates the size of the gene set, and the more reddish the color, the smaller the $P$ value and the more significant the result.

In the GSEA, we choose the MSigDB database as the background gene set. The result of GSEA indicated that the up-regulated genes were significantly enriched in hallmark gene sets in the MSigDB database (Fig. 9A). We can see that the first four pathways with significant gene enrichment are “HALLMARK_BILE_ACID_METABOLISM” and “HALLMARK_COAGULATION”, “HALLMARK_FATTY_ACID_ METABOLISM” and “HALLMARK_PEROXISOME”. Figure 9B shows that down-regulated genes are significantly enriched in the MSigDB database regulatory target gene sets. The first four pathways with significant gene enrichment are “TCANNTGAY_SREBP1_01”, “ZNF140_TARGET_GENES”, “ZNF329_TARGET_GENES” and “ZNF563_TARGET_GENES”.

Figure 9.

GSEA results in MSigDB and GSEA results in KEGG. (A) GSEA results in MSigDB hallmark gene sets. (B) Several significant enrichment pathways in KEGG.

4. Discussion

HCC is still a significant challenge facing global public health, and many studies are devoted to clarifying the pathogenesis and epidemiology of HCC. Although surgery and drug treatment have made significant progress, the prognosis of HCC is still poor. Considering the huge heterogeneity of HCC, there is an urgent need to establish more accurate prognostic prediction models. Since its establishment, the TCGA database has provided great help for cancer researchers to improve the prevention, diagnosis, and treatment of cancer. The type of TCGA data we downloaded is RNA sequencing. We also selected a data set (GSE112790) in the GEO database to screen for the same differential genes. This step makes our subsequent research more generally applicable.

In this paper, the first step is to identify differentially expressed genes from the two projects TCGA-LIHC and GSE112790. The close and subtle relationship between genes and patient survival is our main concern, so we use univariate COX and multivariate COX regression analyses to further identify the survival-related genes. The unsupervised clustering algorithm K-means has the advantages of strong interpretability and fast convergence. We put survival-related genes into K-Means to cluster two types of hepatocellular carcinoma patients. One subtype of the patient has a significantly better survival rate than another subtype. We realized that this result from the machine learning algorithm is not accidental. We respectively used random forest and LASSO regression algorithms to help identify the key genes that affect the prognosis of patients with hepatocellular carcinoma. Finally, we trained the DNN through a part of the sample and used the other part of the data unfamiliar to the DNN to predict the subtype of patients. We used the Kaplan-Meier method to evaluate the effect.

We have obtained an HCC prognosis prediction model with 17 genes as the main body, which predicts the survival rate of HCC patients, and the effect is pretty good. The 17 genes are ALDH2, GYS2, SLC10A1, SLC27A5, SLC22A1, PKM, NCAPD2, HJURP, CKS2, MELK, CEP55, KIF18B, SAE1, CDCA3, AURKB, NUP62, MTFR2. Five genes, including ALDH2, GYS2, SLC10A1, SLC27A5, and SLC22A1 were down-regulated in a subtype with a poor prognosis. ALDH2 can be a potential therapeutic target for cancer treatment, and related studies have also identified it as a possible prognostic marker for several cancers [22]. In this study, the expression of ALDH2 in a subtype of patients with poor prognosis was down-regulated, which coincides with the study of Zahid et al. [23], and they found that transcriptional inhibition of ALDH2 expression can reduce the survival rate of HCC patients. Some scientists have found that the high expression of ALDH2 in the body inhibits the expression of DNA repair protein (XRCC1), resulting in a low survival rate for HCC patients [24].

Furthermore, according to a recent report, ALDH2 promotes the transfer of extracellular vesicles enriched with oxidized mtDNA from weakened liver cells into HCC cells, which will accelerate the progression of HCC [25]. This is enough to show that ALDH2 could be a suitable biological marker and target for evaluating prognosis and improving the therapy options for HCC. According to a study, glycogen synthase 2 (GYS2) inhibited tumor growth in hepatitis B virus-related HCC via a negative feedback loop with p53. GYS2 expression was considerably downregulated in HCC, linked to lower glycogen content and poor patient outcomes [26]. GYS2 has also been reported to appear in models that identify the survival period of hepatocellular carcinoma [27]. SLC10A1 (also called NCTP) has also been reported as a potential HCC early diagnosis and prognostic marker. Some studies have shown that in HCC patients, the down-regulation of SLC10A1 is significantly related to recurrence-free survival (RFS), S-Me, and S-Pf [28, 29]. Studies have shown through in vitro experiments that the down-regulation of SLC27A5 promotes the progression of HCC by driving EMT. The expression of SLC27A5 is positively correlated with the prognosis of HCC. SLC27A5 expression is downregulated in tumor tissues compared to non-tumor tissues. This is also consistent with the performance of SLC27A5 in patients with a poor prognosis in our study [30]. Down-regulation of SLC22A1, which codes for the organic cation transporter-1 (OCT1), alters the responsiveness of hepatocellular carcinoma (HCC) to the cationic medication sorafenib [31].

12 genes including PKM, NCAPD2, HJURP, CKS2, MELK, CEP55, KIF18B, SAE1, CDCA3, AURKB, NUP62, MTFR2 were up-regulated in a subtype with poor prognosis. Hypoxia-inducible factor 1 $\alpha$ (HIF-1 $\alpha)$ can activate PKM transcriptionally, and HCC patients with HIF-1 $\alpha$ -regulated genes had significantly lower overall survival [32]. NCAPD2 can efficiently distinguish tumor tissues from non-tumor tissues [33]. Studies have shown that HJURP induces HCC cell proliferation by ubiquitinating and localizing p21 in the cytoplasm via MAPK/ERK1/2 and AKT/GSK3 $\beta$ pathways, and high levels of HJURP expression are related to a poor prognosis in HCC patients [34]. A study shows that CKS2 overexpression is related to hepatocellular carcinoma cell proliferation inhibiting PTEN [35]. A study shows that MELK may have a role in hepatocarcinogenesis via interactions with the FOXM1/ $\beta$ -catenin pathway and may be a novel pathogenic driver of HCC [36]. Another study also shows that MELK overexpression in multiple cells and tissues has been linked to the progression and growth of various malignancies [37]. A study shows that CEP55 increases HCC cell motility and invasion through modulating JAK2–STAT3–MMP signaling, and increased CEP55 expression is associated with HCC development and poor prognosis in HCC patients [38]. A study shows that KIF18B may behave as an oncogene in HCC by boosting the Wnt/ $\beta$ -catenin pathway’s activity [39]. SAE1 overexpression was found to be substantially linked to metastasis and disease progression. Upregulated SAE1 has an oncogenic impact linked to dysregulated cancer metabolic signaling [40]. A study shows that CDCA3 is a novel prognostic biomarker in hepatocellular carcinoma associated with immune infiltration [41]. A study shows that Aurora kinase B(AURKB) predicts aggressive recurrence of hepatocellular carcinoma [42]. A study shows that nucleoporin 62 (NUP62) is expressed at high levels in stratified squamous epithelia and is even higher in SCCs, revealing its role in cell fate control via modulation of $\Delta$ Np63 $\alpha$ nuclear transport in SCCs [43]. MTFR2 was found to be important in stimulating mitochondrial division in cells. Abnormal mitochondrial mitosis was linked to malignancies in pathological circumstances [44, 45].

In addition, after the classification of hepatocellular carcinoma patients, we conducted a differential expression analysis between two subtypes and used GO and KEGG databases to annotate the DEGs obtained. Genes up-regulated in patients with subtype C1 (low-risk) are significantly enriched in metabolic pathways such as “Complement and coagulation cascades”, “Metabolism of xenobiotics by cytochrome P450” and “Drug metabolism-cytochrome P450”. Among them, “Cytochromes P450” (henceforth P450s) are engaged in a wide range of metabolic and biochemical processes. Similarly, genes that are down-regulated in patients with subtype C1 (low-risk) are significantly enriched in metabolic pathways such as “Cell cycle”, “DNA replication”, and “MicroRNAs in cancer”. Down-regulation of specific miRNAs can lead to increased oncogene expression, which can have severe implications for cell proliferation, differentiation, apoptosis, and tumor growth and progression.

Through the filter we designed, most of the genes that finally stand out have an essential impact on the development of tumors. Many genes, such as ALDH2 and SAE1, have even been reported as independent prognostic markers for hepatocellular carcinoma. That is to say, we have used excellent machine learning algorithms and clever design logic to obtain a prediction model containing many genes that have been reported to be closely related to the formation and development of hepatocellular carcinoma. Then the model based on this same logic also contains some genes that have not been thoroughly studied in the mechanism of hepatocellular carcinomas, such as AURKB, MTFR2, etc. they all have some reports on hepatocellular carcinoma. Still, the excavation is not deep enough, and MTFR2 has been reported in the clinical value and potential mechanism of lung adenocarcinoma [46].

This study obtained a DNN model containing 17 genes, which brings together many genes that play an essential role in hepatocellular carcinoma. The mechanism of action of some genes has been unearthed, and some are not. To guide clinical treatment, we save these 17 genes in a DNN model to help distinguish low-risk subtypes from high-risk subtypes. Such a collection can also guide researchers to in-depth exploration of genes related to the formation and development of hepatocellular carcinoma.

5. Conclusions

All in all, we used the machine learning algorithm to screen out 17 survival-related genes of HCC patients and trained a DNN model based on them to predict the survival risk of HCC patients. The genes that make up the model are all key genes that affect the formation and development of cancer. Their expression and weight coefficients are stored in DNN, which facilitates us to predict the survival rate of any HCC patient at the molecular level. Such an efficient and accurate model will significantly help the prognostic treatment of HCC patients.

Author contributions

Conception: Quan Zi, Hanwei Cui.

Interpretation or analysis of data: Quan Zi, Wei Liang.

Preparation of the manuscript: Quan Zi.

Revision for important intellectual content: Quan Zi, Qingjia Chi.

Supervision: Quan Zi, Qingjia Chi.

Declaration of competing interests

The authors have declared that no competing interests exist.

Funding

These studies were supported by grants from the National Natural Science Foundation of China (Grant Number: 81970631, 81801639).

Footnotes

Acknowledgments

We appreciate the generosity of TCGA database and GEO database for sharing the huge amount of data.

References

Ferlay

Soerjomataram

Dikshit

Eser

Mathers

Rebelo

Parkin

D.M.

Forman

and Bray

, Cancer incidence and mortality worldwide: Sources, methods and major patterns in GLOBOCAN 2012, International Journal of Cancer 136 (2015), E359–E386.

Dutta

and Mahato

R.I.

, Recent advances in hepatocellular carcinoma therapy, Pharmacology & Therapeutics 173 (2017), 106–117.

Llovet

J.M.

Villanueva

Lachenmayer

and Finn

R.S.

, Advances in targeted therapies for hepatocellular carcinoma in the genomic era, Nature Reviews Clinical Oncology 12 (2015), 436–436.

Issa

N.T.

Stathias

Schürer

and Dakshanamurthy

, Machine and Deep Learning Approaches for Cancer Drug Repurposing, Seminars in Cancer Biology (2020).

Yang

Wang

Bai

and Zhang

, A network-based predictive gene expression signature for recurrence risks in stage II colorectal cancer, Cancer Medicine 9 (2019).

Xue

Zhan

Zhang

Yuan

and Fan

, Development and Validation of a 12-Gene Immune Relevant Prognostic Signature for Lung Adenocarcinoma Through Machine Learning Strategies, Frontiers in Oncology 10 (2020), 835.

Liu

Duan

Huang

Jin

Niu

Zhang

and Chen

, Identification of an Immune-Related Prognostic Signature Associated With Immune Infiltration in Melanoma, Frontiers in Genetics 11 (2020), 1002.

Jee

B.A.

Choi

J.H.

Rhee

Yoon

and Park

Y.N.

, Dynamics of Genomic, Epigenomic, and Transcriptomic Aberrations during Stepwise Hepatocarcinogenesis, Cancer Research 79 (2019), canres.

Losic

Craig

A.J.

Villacorta-Martin

Martins-Filho

S.N.

and Villanueva

, Intratumoral heterogeneity and clonal evolution in liver cancer, Nature Communications 11 (2020), 291.

10.

Kim

K.H.

and Roberts

, Targeting EZH2 in cancer, Nature Medicine 22 (2016), 128–134.

11.

Chen

Zhao

Chen

Tian

and Li

, CD24 isoform a promotes cell proliferation, migration and invasion and is downregulated by EGR1 in hepatocellular carcinoma, Oncotargets & Therapy 12 (2019), 1705–1716.

12.

Jia

Yang

Liu

Herman

J.G.

and Guo

, SOX17 antagonizes WNT/β-catenin signaling pathway in hepatocellular carcinoma, Epigenetics 5 (2010), 743–749.

13.

Ritchie

M.E.

Belinda

Law

C.W.

Wei

and Smyth

G.K.

, Limma powers differential expression analyses for RNA-sequencing and microarray studies, Nucleic Acids Research 43 (2015), e47.

14.

Simon

Theodor

P.P.

and Wolfgang

, HTSeq – a Python framework to work with high-throughput sequencing data, Bioinformatics (2015), 166–9.

15.

Chen

H.Y.

C.Y.

Wang

J.L.

Qian

Zhang

and Fang

J.Y.

, A long non-coding RNA signature to improve prognosis prediction of colorectal cancer, Oncotarget 5 (2014).

16.

Macqueen

, Some methods for classification and analysis of multivariate observations, Proc Symp Math Statist and Probability, 5th 1 (1967).

17.

Díaz-Uriarte

and Andrés

S.A.D.

, Gene selection and classification of microarray data using random forest, Bmc Bioinformatics 7 (2006), 3.

18.

Friedman

J.H.

Hastie

and Tibshirani

, Regularization Paths for Generalized Linear Models via Coordinate Descent, Journal of Statistical Software 33 (2010).

19.

Subramanian

Tamayo

Mootha

V.K.

Mukherjee

Ebert

B.L.

Gillette

M.A.

Paulovich

Pomeroy

S.L.

Golub

T.R.

Lander

E.S.

and Mesirov

J.P.

, Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles, Proceedings of the National Academy of Sciences 102 (2005), 15545–15550.

20.

Ashburner

Ball

C.A.

Blake

J.A.

Botstein

Butler

Cherry

J.M.

Davis

A.P.

Dolinski

Dwight

S.S.

and Eppig

J.T.

, Gene Ontology: Tool for the Unification of Biology, Nature Genetics (2000).

21.

Kanehisa

and Goto

, KEGG: kyoto encyclopedia of genes and genomes, Nucleic Acids Research (2000).

22.

Zhang

and Fu

, The role of ALDH2 in tumorigenesis and tumor progression: Targeting ALDH2 as a potential cancer treatment – ScienceDirect, Acta Pharmaceutica Sinica B (2021).

23.

Zahid

K.R.

Yao

Khan

Raza

and Gou

, mTOR/HDAC1 Crosstalk Mediated Suppression of ADH1A and ALDH2 Links Alcohol Metabolism to Hepatocellular Carcinoma Onset and Progression in silico, Frontiers in Oncology 9 (2007).

24.

Chen

Legrand

A.J.

Cunniffe

Hume

Poletto

Vaz

Ramadan

Yao

and Dianov

G.L.

, Interplay between base excision repair protein XRCC1 and ALDH2 predicts overall survival in lung and liver cancer patients, Cellular Oncology (2018).

25.

Seo

Gao

Sun

Feng

Park

S.H.

Cho

Y.E.

Guillot

and Ren

, ALDH2 deficiency promotes alcohol-associated liver cancer by activating oncogenic pathways via oxidized DNA-enriched extracellular vesicles, Journal of Hepatology 71 (2019), 1000–1011.

26.

Chen

S.L.

Zhang

C.Z.

Liu

L.L.

S.X.

Pan

Y.H.

Wang

C.H.

Y.F.

Lin

C.S.

Yang

and Xie

, A GYS2/p53 negative feedback loop restricts tumor growth in HBV-related hepatocellular carcinoma, Cancer Research (2018).

27.

Liu

G.M.

Zeng

H.D.

Zhang

C.Y.

and Xu

J.W.

, Identification of a six-gene signature predicting overall survival for hepatocellular carcinoma, Cancer Cell International 19 (2019).

28.

Qht

Vgn

Cong

and Mnn

, Down-regulation of solute carrier family 10 member 1 is associated with early recurrence and poorer prognosis of hepatocellular carcinoma, Heliyon 7 (2021), e06463.

29.

Gao

Zhu

Dong

Shi

and Fan

, Integrated Proteogenomic Characterization of HBV-Related Hepatocellular Carcinoma, Cell 179 (2019), 1240.

30.

Zhang

Xue

Jiang

Qiu

Yang

and Bao

, Identifying SLC27A5 as a potential prognostic marker of hepatocellular carcinoma by weighted gene co-expression network analysis and in vitro assays, Cancer Cell International 21 (2021), 174.

31.

Herraez

Lozano

Macias

Vaquero

Bujanda

Banales

J.M.

Marin

and Briz

, Expression of SLC22A1 variants may affect the response of hepatocellular carcinoma and cholangiocarcinoma to sorafenib, Hepatology 58 (2013).

32.

Hamaguchi

Iizuka

Tsunedomi

Hamamoto

and Oka

, Glycolysis module activated by hypoxia-inducible factor 1alpha is related to the aggressive phenotype of hepatocellular carcinoma, International Journal of Oncology 33 (2008), 725.

33.

Zhao

Zhang

Shao

Sun

and Lin

, Identification of hub genes and biological pathways in hepatocellular carcinoma by integrated bioinformatics analysis, PeerJ 9 (2021), e10594.

34.

Chen

Huang

Yuan

Lei

Shen

Yin

Zhou

and Zheng

, HJURP promotes hepatocellular carcinoma proliferation by destabilizing p21 via the MAPK/ERK1/2 and AKT/GSK3β signaling pathways, Journal of Experimental & Clinical Cancer Research 37 (2018), 193.

35.

Xue

Fang

and Gao

High-expressed CKS2 is associated with hepatocellular carcinoma cell proliferation through down-regulating PTEN, Pathology Research & Practice (2018).

36.

Xia

Kong

S.N.

Chen

Shi

Sekar

Seshachalam

V.P.

Rajasekaran

Goh

B.P.

Ooi

L.L.

and Hui

K.M.

, MELK is an oncogenic kinase essential for early hepatocellular carcinoma recurrence, Cancer Letters (2016), 85–93.

37.

Thangaraj

Ponnusamy

Natarajan

S.R.

and Manoharan

, MELK/MPK38 in cancer: from mechanistic aspects to therapeutic strategies, Drug Discovery Today (2020).

38.

and Yin

, CEP55 Promotes Cell Motility via JAK2-STAT3-MMPs Cascade in Hepatocellular Carcinoma, Cells 7 (2018), 99.

39.

Yang

Wang

Xie

Wang

Gao

Rong

Liu

and Lu

, KIF18B promotes hepatocellular carcinoma progression through activating Wnt/β-catenin-signaling pathway, Journal of Cellular Physiology (2020).

40.

Ong

J.R.

Bamodu

O.A.

Khang

N.V.

Lin

Y.K.

and Cherng

Y.G.

, SUMO-Activating Enzyme Subunit 1 (SAE1) Is a Promising Diagnostic Cancer Metabolism Biomarker of Hepatocellular Carcinoma, Cells 10 (2021), 178.

41.

Wang

Chen

Wang

and Qin

, CDCA3 is a novel prognostic biomarker associated with immune infiltration in hepatocellular carcinoma, BioMed Research International 2021 (2021).

42.

Tanaka

Arii

Yasen

Mogushi

N.T.

Zhao

Imoto

Eishi

Inazawa

and Miki

, Aurora kinase B is a predictive factor for the aggressive recurrence of hepatocellular carcinoma after curative hepatectomy, British Journal of Surgery 95 (2008), 611–619.

43.

Hazawa

De Hen

Kobayashi

Jiang

Y.I.

and Wong

R.W.

, ROCK-dependent phosphorylation of NUP62 regulates p63 nuclear transport and squamous cell carcinoma proliferation, Embo Reports 19 (2018), e201744523.

44.

Zhou

Hua

Tan

Fan

Cao

Meng

Zhu

Zhao

and Guan

M.X.

, Inhibiting neddylation modification alters mitochondrial morphology and reprograms energy metabolism in cancer cells, JCI Insight 4 (2019).

45.

Zeng

Guo

Liu

and He

, FGD1 exhibits oncogenic properties in hepatocellular carcinoma through regulating cell morphology, autophagy and mitochondrial function, Biomedicine & Pharmacotherapy = Biomedecine & Pharmacotherapie 125 (2020), 110029.

46.

Chen

Tang

Han

and Ke

, Evaluation of clinical value and potential mechanism of MTFR2 in lung adenocarcinoma via bioinformatics, BMC Cancer (2020).