Abstract
Objective
The objective was to identify potential hub genes associated with the pathogenesis and prognosis of hepatocellular carcinoma (HCC).
Methods
Gene expression profile datasets were downloaded from the Gene Expression Omnibus database. Differentially expressed genes (DEGs) between HCC and normal samples were identified via an integrated analysis. A protein–protein interaction network was constructed and analyzed using the STRING database and Cytoscape software, and enrichment analyses were carried out through DAVID. Gene Expression Profiling Interactive Analysis and Kaplan–Meier plotter were used to determine expression and prognostic values of hub genes.
Results
We identified 11 hub genes (
Conclusions
In this study, we identified key genes of HCC, which indicated directions for further research into diagnostic and prognostic biomarkers that could facilitate targeted molecular therapy for HCC.
Keywords
Introduction
On a global scale, cancer is the main public health problem and liver cancer is a major contributor to both cancer morbidity and mortality. 1 Liver cancer is the sixth most common cancer and the fourth highest cause of cancer-related mortality worldwide. 2 There were expected to be 42,030 newly diagnosed cases and 31,780 deaths of liver cancer in the United States during 2019. 3 Hepatocellular carcinoma (HCC) is the most common form of primary liver cancer, comprising 75% to 85% of cases. 2 The well-recognized risk factors for HCC include chronic infection with hepatitis B (HBV) or hepatitis C virus, exposure to dietary aflatoxin, alcohol-induced cirrhosis, smoking, obesity, and type 2 diabetes.2,4 In Asia (especially China), chronic HBV infection is the leading etiologic factor of HCC. 5 Most HCC patients are diagnosed at an advanced stage, and locoregional treatments (chemoembolization) and surgical treatments are relatively disappointing in terms of overall survival (OS) of patients with advanced disease. 6 In addition, traditional chemotherapies have not shown promising outcomes in treatment of HCC and have significant toxicity.6,7 Meanwhile, the lack of early detection of diagnostic markers and limited treatment strategies increase the risk of poor prognosis and death. 8 Therefore, there is a pressing need to develop robust diagnostic strategies and effective therapies for HCC patients. 9
Over the past decades, microarray technology and bioinformatics have been extensively applied to identify the molecular mechanisms of HCC, which provide strong research support for the diagnosis, treatment, and prognosis of HCC. 10 Because of the ability to process a large number of datasets quickly, integrated bioinformatics analysis and microarray technology have allowed researchers to comprehensively identify the functions of numerous differentially expressed genes (DEGs) in HCC, and they help researchers explore the complicated process of HCC occurrence and development.10,11 A work by He et al. 12 identified four hub genes and two important pathways in the development of HCC from cirrhosis from one Gene Expression Omnibus (GEO) dataset using a bioinformatics method, including DEG screening, enrichment analyses, and construction of a protein–protein interaction (PPI) network. Zhang et al. 13 screened hub genes and pathways correlated with the occurrence and progression of HCC via a series of bioinformatics analyses incorporating DEGs identification, functional enrichment analyses, PPI network and module analysis, and weighted correlation network analysis. Zhou et al. 11 identified the pivotal genes and microRNAs in HCC using a bioinformatics approach, including analysis of raw data via GEO2R, Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses, and construction of PPI network. However, to improve the diagnosis and treatment of HCC, novel diagnostic and prognostic biomarkers for HCC are needed. The flowchart of the study approach is shown in Figure 1.

Flowchart for identification of core genes and pathways for hepatocellular carcinoma (HCC). GEO, Gene Expression Omnibus; DEG, differentially expressed gene; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; PPI, protein–protein interaction; GEPIA, Gene Expression Profiling Interactive Analysis.
Materials and methods
Ethical approval
Ethical approval was not required in this study because we analyzed only published data from the GEO database.
Gene expression profile data
Gene expression profile data (GSE36376, 14 GSE39791, 15 GSE41804, 16 GSE54236,17,18 GSE57957, 19 GSE62232, 20 GSE64041, 21 GSE69715, 22 GSE76427, 23 GSE84005, GSE87630, 24 GSE112790, 25 and GSE121248 26 ) were downloaded from the GEO database (http://www.ncbi.nlm.nih.gov/geo/), 27 a public data repository, including high-throughput gene expression and other functional genome datasets. The selection criteria for the included datasets were as follows: (1) tissue samples collected from human HCC and corresponding adjacent or normal tissues; and (2) including at least 40 samples.
Integrated analysis of microarray datasets
The matrix data of each GEO dataset were normalized and log2 transformed using the R software package limma,
28
and the DEGs between HCC and corresponding adjacent or normal tissues were also filtered using the limma package. Integration of DEGs identified from the 13 datasets was performed by RobustRankAggreg package
29
in R software. A |log2 fold change (FC)| ≥1 and adjusted
Enrichment analyses of DEGs
Database for Annotation, Visualization and Integrated Discovery (DAVID; https://david.ncifcrf.gov/, version, 6.8) 30 is a comprehensive functional annotation tool for extracting biological significance from large gene/protein datasets. In this study, the GO and KEGG pathway enrichment analyses for the DEGs were conducted via DAVID. The visualization of enrichment analysis results was conducted by using ggplot2 31 and the GOplot 32 package in the R software.
PPI network and module analysis
Search Tool for the Retrieval of Interacting Genes/Proteins (STRING; https://string-db.org/) 33 is a database of known and predicted protein interactions, showing direct and indirect interactions among proteins. This database was applied to obtain potential interactions among the DEGs. PPIs with a confidence score ≥0.7 were reserved and imported into Cytoscape software 34 to construct the PPI network. Furthermore, the clustering modules in this PPI network were analyzed using the MCODE (Molecular Complex Detection) plugin in Cytoscape. 35 Pathway enrichment analyses for important modules were also carried out. The visualization of enrichment analysis results was performed by using the imageGP platform (http://www.ehbio.com/ImageGP/index.php/Home/Index/GOenrichmentplot.html).
Survival analysis of hub genes
Kaplan–Meier plotter (KM plotter; http://kmplot.com/analysis/) is a database containing clinical data and gene expression data.
36
This database is used to further understanding the molecular basis of disease and identifying biomarkers associated with survival.
37
The recurrence-free survival and OS information were based on GEO, the European Genome-phenome Archive (EGA), and The Cancer Genome Atlas (TCGA) databases. Hazard ratios (HR) with 95% confidence intervals and log rank
Expression level analysis and correlation analysis of hub genes
The Gene Expression Profiling Interactive Analysis (GEPIA; http://gepia.cancer pku.cn/index.html) 39 is a newly developed web-based tool that applies a standard processing pipeline to analyze gene expression data between tumor and normal tissues. The relationship of expression of hub genes in HCC and normal tissues were visualized by boxplot. 38 In addition, correlation analysis was performed by GEPIA to check the relative ratios between two genes. 39
Results
Identification of DEGs
In the present study, 13 datasets were downloaded from GEO that included 1100 cancer tissues and 717 corresponding adjacent or normal tissues (Table 1). After integrated analysis, 380 DEGs (293 downregulated and 87 upregulated) were identified (Figure 2a-m and Appendix). Figure 2n shows the top 20 down- and upregulated genes.
Information for the 13 Gene Expression Omnibus datasets included in the current study.

Identification of DEGs. Volcano plots of Gene Expression Omnibus datasets (a) GSE36376, (b) GSE39791, (c) GSE41804, (d) GSE54236, (e) GSE57957, (f) GSE62232, (g) GSE64041, (h) GSE69715, (i) GSE76427, (j) GSE84005, (k) GSE87630, (l) GSE112790, and (m) GSE121248; (n) heat map of DEGs. Blue indicates lower expression levels, red indicates higher expression levels, and white indicates no differentially expression among the genes. Each column represents one dataset and each row represents one gene. The number in each rectangle represents the normalized gene expression level. The gradual color ranged from blue to red represents the changing process from downregulation to upregulation. DEG, differentially expressed gene.
GO and KEGG pathway enrichment analyses of DEGs
To deepen our understanding of DEGs, we performed GO and KEGG pathway enrichment analyses. Thirty-one significantly enriched GO terms were selected based on a false discovery rate (FDR) < 0.05 (Figure 3a and Appendix). In the GO terms were 13 terms for biological process, mainly related to metabolic process, P450 pathway, and oxidation-reduction process; 12 terms for molecular function, highly involved with multiple enzyme activities, heme binding, iron ion binding and oxygen binding; and 6 terms for cellular components, associated with organelle membrane, extracellular exosome, extracellular region, extracellular space, blood microparticle, and membrane attack complex.

Enrichment analysis of DEGs. (a) GO enrichment analysis of DEGs, (b) top 5 terms of GO enrichment, and (c) KEGG pathway analysis of DEGs. DEG, differentially expressed gene; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; FDR, false discovery rate.
In the KEGG pathway enrichment analyses, we screened nine pathways according to FDR < 0.05. Figure 3c shows the results of KEGG analysis; the DEGs primarily participated in diverse metabolism-associated signaling pathways, such as metabolic pathways, retinol metabolism, drug metabolism-cytochrome P450, among others.
PPI network establishment and module analysis
The PPI network of DEGs comprised 242 nodes and 1267 interactions (Figure 4a); degree was calculated to identify candidate key nodes. Finally, 11 potential key nodes were identified, the degrees of which were all more than four times the corresponding average values:

PPI network and hub clustering modules. (a) The PPI network of DEGs, (b) module 1 (MCODE score = 38.769), and (c) module 2 (MCODE score = 10.364). Blue circles represent downregulated genes and red circles represent upregulated genes. PPI, protein–protein interaction; DEG, differentially expressed gene; MCODE, Molecular Complex Detection.
Upregulated hub genes with high degrees.

Pathway analysis of the two modules with the highest scores. The y-axis shows significantly enriched KEGG pathways, and x-axis shows the two modules. KEGG, Kyoto Encyclopedia of Genes and Genomes; FDR, false discovery rate.
Survival analysis, expression, and correlation analysis of hub genes
Survival analysis of 11 hub genes was performed using the KM plotter. The results showed that high expression of

Prognostic roles of 11 hub genes in patients with HCC shown as survival curves. (a)

Analysis of expression levels of 11 hub genes in human HCC. The red and gray boxes represent cancer and normal tissues, respectively. (a)

Correlation analysis of 10 hub genes in HCC with
Discussion
HCC is the most common type of malignancy and one of the leading causes of cancer-related mortality worldwide.40,41 Although much research has been done on HCC, its early diagnosis and treatment remains difficult because of a lack of understanding of the molecular mechanisms associated with HCC occurrence and development.
41
Therefore, in-depth studies of the etiological factors and molecular mechanisms of HCC are of critical importance for HCC diagnosis and treatment.
11
Currently, bioinformatics analysis and microarray technology are developing rapidly and this approach can be used to identify therapeutic targets for diagnosis, therapy, and prognosis of a variety of neoplasms.
42
In this research, we identified 380 DEGs, including 293 downregulated and 87 upregulated genes, between HCC and corresponding adjacent or normal tissues. Enrichment analyses indicated that the DEGs were mostly associated with metabolic processes, such as metabolism of retinol, drugs, xenobiotics, tyrosine, tryptophan, and histidine, as well as fatty acid degradation. This indicated that metabolic dysregulation is closely related to HCC. In addition, we obtained 11 hub genes (
Recent evidence implies that tumor cells need specific interphase cyclin-dependent kinases (CDKs) to proliferate.
43
Cyclin-dependent kinase 1 (CDK1) belongs to the CDK family, a member of the serine/threonine protein kinases, and it is crucial for the cell cycle phase transitions G1/S and G2/M.44,45 CDK1 is required for cell proliferation because it is the only CDK that can initiate mitosis.
46
The deregulation of CDK1 is likely related to HCC tumorigenesis.
47
Research has found that high expression of CDK1 is correlated with poor OS of HCC.
45
Cyclins act as the regulatory subunits of the CDKs, regulating temporal transitions among various stages of the cell cycle via CDK activation.
48
Cyclin-A2 (CCNA2), cyclin-B1 (CCNB1), and cyclin-B2 (CCNB1), encoded by the
Aurora kinase A (encoded by
According to enrichment analyses of two modules, we found that module 1 was mostly associated with cell cycle and module 2 was closely related to metabolic pathways. Furthermore, all 11 hub genes belonged to module 1 and most are associated with cell cycle and enriched in the “cell cycle” pathway. A number of studies have elucidated that cell cycle disorders are closely related to human cancer. 43 Therefore, the carcinogenesis and progression of HCC may be associated with the cell cycle pathway, and we might be able to suppress HCC cell cycle progression, inhibit HCC cell proliferation, and reduce HCC malignancy by downregulating expression of the 11 hub genes identified herein.
Compared with previous studies, this work has several advantages, as follows. First, the current integrated microarray data used a relatively large sample size from several GEO datasets (GSE36376, 14 GSE39791, 15 GSE41804, 16 GSE54236,17,18 GSE57957, 19 GSE62232, 20 GSE64041, 21 GSE69715, 22 GSE76427, 23 GSE84005, GSE87630, 24 GSE112790, 25 and GSE121248 26 ). Second, functional enrichment analyses were performed to identify the main biological functions and pathways regulated by DEGs. Third, we established PPI networks, conducted module analysis, discovered potential biomarkers for diagnosis and prognosis of HCC, and performed correlation analysis of hub genes.
The limitations of this work were as follows: First, our results need to be verified by corresponding experimental studies. Second, we obtained data from the GEO database, and data quality cannot be verified. Finally, our study focused on genes that are typically identified as significant changes in diverse datasets, without regard to sex, age, or grading and staging of tumors from which the samples were derived.
Conclusion
In the present work, we identified 11 hub genes (
Footnotes
Declaration of conflicting interest
The authors declare that there is no conflict of interest.
Funding
This work was supported by the National Natural Science Foundation of China (No. 81473547, 81673829) and the Young Scientists Training Program of Beijing University of Chinese Medicine.
Appendix
Information for 293 downregulated genes (down) and 87 upregulated genes (up).
Information on Gene Ontology (GO) enrichment analysis in biological process (BP), cellular component (CC), and molecular function (MF) categories.
