Abstract
Objective
Non-small-cell lung cancer (NSCLC) accounts for >85% of lung cancers, and its incidence is increasing. We explored expression differences between NSCLC and normal cells and predicted potential target sites for detection and diagnosis of NSCLC.
Methods
Three microarray datasets from the Gene Expression Omnibus database were analyzed using GEO2R. Gene Ontology and Kyoto Encyclopedia of Genes and Genomes enrichment analysis were conducted. Then, the String database, Cytoscape, and MCODE plug-in were used to construct a protein–protein interaction (PPI) network and screen hub genes. Overall and disease-free survival of hub genes were analyzed using Kaplan-Meier curves, and the relationship between expression patterns of target genes and tumor grades were analyzed and validated. Gene set enrichment analysis and receiver operating characteristic curves were used to verify enrichment pathways and diagnostic performance of hub genes.
Results
In total, 293 differentially expressed genes were identified and mainly enriched in cell cycle, ECM–receptor interaction, and malaria. In the PPI network, 36 hub genes were identified, of which 6 were found to play significant roles in carcinogenesis of NSCLC: CDC20, ECT2, KIF20A, MKI67, TPX2, and TYMS.
Conclusion
The identified target genes can be used as biomarkers for the detection and diagnosis of NSCLC.
Keywords
Introduction
Non-small-cell lung cancer (NSCLC) is the most common pathological type of lung cancer and it accounts for more than 85% of all lung cancers. 1 Currently, its morbidity and mortality are increasing from year to year.2,3 In China, NSCLC is also persistently rising. 4 The occurrence and development of NSCLC are caused by changes in multi-gene expression and various signal transductions.5,6 As a result, the precise mechanism of NSCLC is difficult to understand. Importantly, early NSCLC-specific symptoms are not obvious and there is no effective diagnostic method for NSCLC in the early stage. Therefore, finding novel biomarkers for diagnosis and prognosis of NSCLC is crucial so that patients can receive appropriate treatment as soon as possible.
In the past years, gene microarray and bioinformatics analysis have been widely used in cancer studies. For instance, Bi et al. and Xu et al.7,8 identified key genes for diagnosis and treatment of ovarian and bladder cancer by using such methods. Similarly, key biological functions of some genes in the diagnosis and prognosis of NSCLC have been elucidated by means of bioinformatics, such as cyclin-A2 (CCNA2), centrosomal protein of 55 kDa (CEP55), and neuromedin U (NMU).9,10 The above approach depends on an effective combination of statistics and bioinformatics analysis. However, a separate microarray analysis will increase the false-positive rate of the results.
To minimize the drawbacks of false-positive and false-negative results, we used 3 mRNA microarray datasets in this study to identify target genes affecting NSCLC. We also studied the relationship between the target genes and NSCLC. These identified target genes may be useful for detection and diagnosis of NSCLC.
Materials and methods
Ethical approval
This research did not use animal or human tissue and therefore did not require ethical approval or patient consent.
Microarray data
Gene expression profiles (GSE10072, 11 GSE19804, 12 and GSE43458 13 ) were obtained from GEO (http://www.ncbi.nlm.nih.gov/geo), a public functional genomics database containing chip, microarray, and high-throughput gene expression data. 14 The GSE10072 dataset contained 58 NSCLC tissue samples and 49 noncancerous samples; GSE19804 contained 60 NSCLC samples and 60 noncancerous samples; and GSE43458 contained 80 NSCLC samples and 30 noncancerous samples.
Data preprocessing and differential expression analysis
To preprocess the datasets, the differentially expressed genes (DEG) between NSCLC samples and noncancerous samples were screened using GEO2R (http://www.ncbi.nlm.nih.gov/geo/geo2r). GEO2R is an online network tool in the GEO database that compares DEGs between two groups of samples. LogFC (fold change) >1 and adjusted P-values <0.01 were considered statistically significant.
Enrichment analysis of DEGs
A functional enrichment analysis was performed to examine the enrichment of annotated terms. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses 15 were conducted using DAVID (https://david.abcc.ncifcrf.gov; version 6.7) with a threshold of P < 0.05.
Protein–protein interaction network construction and cluster analysis
A protein–protein interaction (PPI) network of DEGs was constructed using the String database (http://string-db.org; version 10.0), and an interaction with a combined score >0.4 was considered statistically significant. Subsequently, the results were visualized using Cytoscape, 16 and the most significant module in the PPI network was identified using the MCODE plugin (version 1.5.2). The criteria for selection were as follows: MCODE score >5, degree cut-off = 2, node score cut-off = 0.2, Max depth = 100, and k-score = 2.
Target genes scan and analysis
The genes of the module and their co-expressed genes were analyzed using the cBioPortal (http://www.cbioportal.org) online platform.17,18 The biological process analysis of the genes was performed and visualized using the Biological Networks Gene Oncology tool (BiNGO; version 3.0.3) plugin of Cytoscape. 19 Hierarchical clustering of hub genes was constructed using the University of California Santa Cruz (UCSC) Cancer Genomics Browser (https://xenabrowser.net/heatmap/). 20 The overall survival and disease-free survival analyses of the genes were assessed using Kaplan-Meier curves in cBioPortal. Furthermore, the survival and receiver operator characteristic (ROC) analyses of hub genes in the TCGA Lung Adenocarcinoma (LUAD) dataset was conducted and visualized using R (www.r-project.org). Gene set enrichment analysis (GSEA) was conducted using GSEA tools (http://www.broad.mit.edu/gsea). The expression profiles of CDC20, ECT2, KIF20A, MKI67, TPX2, and TYMS were analyzed and displayed using the database Oncomine (http://www.oncomine.com). The relationships between expression patterns and tumor grades were also analyzed using Oncomine.21–23
Results
Identification of DEGs in NSCLC
After standardization of the microarray results, DEGs (6,656 in GSE10072, 1,404 in GSE19804, and 895 in GSE43458) were identified. The overlap among the 3 datasets contained 293 genes, as shown in the Venn diagram (Figure 1a), consisting of 167 downregulated genes and 126 upregulated genes between NSCLC and noncancerous tissues.

Analysis and screening of DEGs in the PPI network. (a) DEGs with a fold change >2 and P-value < 0.01 were selected from among the mRNA expression profiling sets GSE10072, GSE19804, and GSE43458. The 3 datasets showed an overlap of 293 genes. (b) GO and (c) KEGG analysis DEGs. (d) The PPI network of DEGs was constructed using Cytoscape. (e) The most significant module was obtained according MCODE. DEG, differentially expressed gene; PPI, protein–protein interaction; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes.
GO and KEGG enrichment analysis of DEGs
The functional and pathway enrichment analysis of DEGs was carried out by using DAVID. The GO analysis results showed that changes in biological processes (BP) of DEGs were significantly enriched in extracellular matrix (ECM) structural constituent, metalloendopeptidase activity, cargo receptor activity, metallopeptidase activity, and glycosaminoglycan binding. Changes in molecular function (MF) were mainly enriched in extracellular structure organization, ECM organization, mitotic nuclear division, cell substrate adhesion, and nuclear division. Changes in cell component (CC) of DEGs were mainly enriched in the ECM, collagen containing ECM, spindle, condensed chromosome outer kinetochore, and spindle pole (Figure 1b). KEGG pathway analysis revealed that the DEGs were mainly enriched in cell cycle, ECM–receptor interaction, and malaria (Figure 1c).
Construction and module analysis of PPI Network
The PPI network of DEGs is shown in Figure 1d, and the most significant module is shown in Figure 1e. The abbreviations and full names for genes involved in this module are shown in Table 1. The functional analyses of these genes showed that 36 genes in this module were mainly enriched in cell division, mitotic nuclear division, and cell cycle (Table 2). A network of these genes and their co-expressed genes is shown in Figure 2a. The most significant biological processes of these genes is shown in Figure 2b. Subsequently, hierarchical clustering showed that the hub genes could differentiate the NSCLC samples from the noncancerous samples (Figure 2c).
The list of genes involved in the most significant module.
Enrichment analysis of DEGs in NSCLC.
DEG, differentially expressed gene, NSCLC, non-small-cell lung cancer; FDR, false discovery rate.

Interaction network and biological process analysis of the module genes. (a) Module genes and their co-expression genes were analyzed using cBioPortal. Nodes outlined in bold black are hub genes; nodes outlined in thin black are co-expression genes. (b) The most significant biological processes of module genes was constructed using BiNGO. (c) Hierarchical clustering of module genes was constructed using UCSC Cancer Genomics Browser (https://xenabrowser.net/heatmap/). Upregulation of genes is marked in red; downregulation of genes is marked in green. Gene symbols shown in red are the six hub genes found to play a significant role in carcinogenesis; gene symbols in black are hub genes identified in the protein–protein interaction network.
Analysis of potential biomarkers for NSCLC
NSCLC patients with alterations in CDC20, ECT2, MKI67, TPX2, and TYMS showed worse overall survival (Figure 3a), and NSCLC patients with alterations in KIF20A, MKI67, and TPX2 showed worse disease-free survival (Figure 3b). Therefore, these genes can be identified as potential NSCLC biomarkers. The Oncomine analysis of cancerous versus normal tissue showed that these genes were significantly overexpressed in NSCLC in the different datasets (Figure 4a and b). Meanwhile, higher mRNA expression of these genes was associated with tumor stage in the Oncomine lung datasets (Figure 5). To clarify the accuracy of this result, we validated these genes by using the TCGA database (Figure 6), and based on the TCGA database, we validated the GSEA. The gene sets with the highest enrichment scores were all closely associated with cell cycle (Figure 7). In addition, ROC curves showed that all these genes could serve as biomarkers to distinguish tumors from normal lung tissue sensitively and accurately. All these genes appeared to be promising candidates for therapeutic targets (Figure 8).

Survival analysis of potential NSCLC biomarkers. (a) Overall survival, and (b) disease-free survival analyses of module genes were analyzed using cBioPortal online platform. P < 0.05 was considered statistically significant. NSCLC, non-small-cell lung cancer.

Oncomine analysis of NSCLC samples and noncancerous samples of potential NSCLC biomarkers. (a) Heat maps of potential NSCLC biomarker expression in clinical lung carcinoma samples versus normal tissues. 1 = Lung Adenocarcinoma vs. Normal Landi Lung, PLoS ONE, 2008. 2 = Lung Adenocarcinoma vs. Normal Okayama Lung, Cancer Res, 2012. 3 = Lung Adenocarcinoma vs. Normal Selamat Lung, Genome Res, 2012. 4 = Lung Adenocarcinoma vs. Normal Su Lung, BMC Genomics, 2007. (b) mRNA expression in NSCLC compared with normal lung tissues. Lower and upper circles indicate the minimum and maximum values, whiskers indicate the 10th and 90th percentiles, and the box indicates the 25th and 75th percentiles, respectively; the line indicates the median. NSCLC, non-small-cell lung cancer.

Association between the expression of potential NSCLC biomarkers and tumor stage. NSCLC, non-small-cell lung cancer.

Survival analysis of potential NSCLC biomarkers using TCGA database. Analyses of CDC20 (a), ECT2 (b), KIF20A (c), MKI67 (d), TPX2 (e), and TYMS (f) were carried out. NSCLC, non-small-cell lung cancer.

Gene set enrichment analysis of potential NSCLC biomarkers using the TCGA database. Analyses of CDC20 (a), ECT2 (b), KIF20A (c), MKI67 (d), TPX2 (e) and TYMS (f) were carried out. NSCLC, non-small-cell lung cancer.

Receiver operating characteristic curve analysis of potential NSCLC biomarkers using the TCGA database. Analyses of CDC20 (a), ECT2 (b), KIF20A (c), MKI67 (d), TPX2 (e), and TYMS (f) were carried out. NSCLC, non-small-cell lung cancer.
Discussion
Biomarkers for diagnosing or treating cancer are often obtained by identifying the most important DEGs in microarray or high-throughput case-control studies. 7 As with any other cancer, the development, progression, and metastasis of lung cancer is a very complex process, involving multiple gene and cellular pathway aberrations. 24 The DEGs between NSCLC and normal tissue may be the core functional genes that promote the occurrence and development of NSCLC.25,26 To improve the diagnosis and treatment of NSCLC, it is important to identify these DEGs and understand their role in the molecular mechanisms of NSCLC.
In the present study, 293 DEGs were identified between NSCLC and noncancerous samples through analysis of three datasets. Among these DEGs, we selected 6 that are closely related to the occurrence and development of NSCLC: CDC20, ECT2, KIF20A, MKI67, TPX2, and TYMS. When the overall survival and disease-free survival analyses of target genes were performed, we found that poor prognosis of NSCLC patients was associated with high expression of target genes. Kato et al. 27 reported that CDC20 was overexpressed in NSCLC, and that overexpression predicts poor prognosis. Bai et al. 28 showed that the overexpression of ECT2 could promote the occurrence and development of NSCLC, suggesting that ECT2 could be used as a diagnostic marker. Ni et al., 29 using bioinformatics analysis, showed that KIF20A was correlated with the pathogenesis and prognosis of NSCLC. Schneider et al. 30 demonstrated that overexpression of TPX2 mRNA in tumor cells is associated with the prognosis of NSCLC patients. Sun et al. 31 showed that mRNA expression of TYMS may have prognostic value for patients with NSCLC treated with platinum-based chemotherapy. These previous studies are consistent with our results and demonstrate the effectiveness of bioinformatics in screening to identify target genes. However, we found no reports associating MKI67 with NSCLC. Therefore, the function of MKI67 to NSCLC needs further experimental confirmation.
In our study, we identified 36 hub genes. Hub genes are involved in many biological processes and induce many signal transductions. Therefore, analyzing the biological functions and signaling pathways related to hub genes can effectively reveal the occurrence and development of NSCLC. GO enrichment analysis revealed that hub genes were mainly enriched in extracellular structure organization, ECM organization, mitotic nuclear division, cell substrate adhesion, and nuclear division, whereas changes in KEGG were mainly enriched in cell cycle, ECM-receptor interaction, and malaria. Previous studies have reported that dysregulation of the cell cycle plays an important role in the carcinogenesis or progression of tumors.32,33 CDC20 can act as a regulatory protein that interacts with other proteins to participate in the cell cycle of tumors. 34 CDC20 has also been shown to be involved in tumor formation by regulating the ECM-receptor interaction pathway. 35 These studies are consistent with our research on CDC20 and confirm our results. However, a large number of studies in NSCLC still need to be further explored.
In conclusion, our research objective was to find new biomarkers related to the diagnosis and prognosis of NSCLC. A total of 293 DEGs and 36 hub genes were identified, and 6 target genes closely related to NSCLC were identified by screening. These bioinformatics analyses provide a new perspective to further understand the occurrence and development of NSCLC and have a positive effect on the treatment of NSCLC. However, the results still need to be rigorously evaluated before clinical treatment can be performed.
Footnotes
Acknowledgements
Special thanks go to Xiu-juan Chen (Reproductive Medicine Center, Affiliated Hospital of Inner Mongolia Medical University, Hohhot, P. R. China), Ri-na Sha (Pathology Department, Affiliated Hospital of Inner Mongolia Medical University, Hohhot, P. R. China), Jian-long Yuan (Clinical Laboratory, Affiliated Hospital of Inner Mongolia Medical University, Hohhot, P. R. China), and Jie Zhao (Reproductive Medicine Center, Affiliated Hospital of Inner Mongolia Medical University, Hohhot, P. R. China) for suggestions with the paper.
Author contributions
Dong-jun Liu designed and supervised the research. Bai Dai performed statistical analyses and wrote the manuscript. Li-qing Ren and Xiao-yu Han did the practical work and revised the manuscript.
Declaration of conflicting interest
The authors declare that there is no conflict of interest.
Funding
This work was supported by the Science and Technology Innovation Guided Project in Inner Mongolia Autonomous Region (KCBJ2018003).
