Abstract
Keywords
Introduction
Lung cancer is a malignant tumor with the highest morbidity and mortality worldwide. 1 Early detection and diagnosis will reduce the mortality of patients with lung cancer. 2 However, in fact, about 60% of patients with non-small cell lung cancer are in the advanced stage at the time of diagnosis. The diagnosis of lung cancer needs comprehensive judgment of many disciplines, including histological diagnosis, complete staging examination, and comprehensive evaluation. 3 However, the traditional diagnosis is subjective, which is prone to divergence and misdiagnosis. With the continuous development of medical big data, more and more researches apply machine learning technology to the fields of early tumor screening, risk factor analysis, and classification.4,5 At present, machine algorithms are effectively used in the diagnosis of lung cancer.6,7 However, there is still a lack of early warning models for lung cancer in clinical practice, and the existing early warning models for lung cancer cannot meet the clinical needs of prognosis evaluation. Therefore, it has potential clinical value to construct an early warning model of lung cancer through machine learning.
The occurrence of cancer is due to the long-term accumulation of a large number of gene mutations in somatic cells, which provide advantages for the transformation of cancer. 8 Somatic mutation not only causes tumor occurrence, but also affects tumor development, such as tumor subtype, metastasis, drug resistance, and immune microenvironment. 9 Lung cancer is characterized by extensive genomic instability. Studies have shown that low genomic instability is related to better survival rate of patients with lung adenocarcinoma (LUAD), suggesting that it may be more practical to construct survival prediction or risk assessment model of lung cancer based on somatic mutation.10,11
Smoking is the main risk factor for the development of lung cancer. 12 In China, the use of tobacco has accelerated the prevalence of lung cancer, and about three-quarters of male lung cancer deaths can be attributed to smoking. 13 In addition, the epidemiology, histological types, and prognosis of lung cancer show strong gender differences. 14 There is convincing evidence that the risk, morbidity, and mortality of lung cancer in women who never smoke are higher than those in men who never smoke. 15 This suggests that smoking and gender may be important indicators affecting the diagnosis and prognosis of lung cancer.
Cyclin-dependent kinase inhibitor 2A (CDKN2A) is a tumor suppressor gene that is easily inactivated in cancer. It has been found that CDKN2A is a prognostic marker or a transcriptome marker for treatment decisions of hepatocellular carcinoma, colorectal cancer, bladder cancer, and other cancers.16–18 In addition, studies have shown that the absence of CDKN2A indicates a poor prognosis of lung cancer and promotes the development of lung cancer. 19 Therefore, it is necessary to use CDKN2A as an indicator for the diagnosis and prognosis of lung cancer.
Therefore, this study aims to construct and validate an early warning model for lung cancer by combining somatic mutation, CDKN2A, smoking, and gender indicators through machine learning.
Materials and Methods
Data
In this study, LUAD somatic mutation data and corresponding clinical information (a total of 567 cases) were downloaded from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/). The 567 cases were the only available data in TCGA database. LUAD transcriptome expression data also downloaded from TCGA database were subjected to extraction of data about 57 tumor samples and 57 corresponding normal tissues from 57 patients who were all cases for which the control sample was available in the 567 cases mentioned above. Since gender and smoking history were involved in the model construction, the samples with unknown gender and smoking history were excluded.
Somatic Mutation Analysis
Random forests are known for their high performance and generalizability. 20 Somatic mutation indices were screened using a random forest algorithm (R package “randomForestSRC”, with variable relative importance > 0.4). According to the outcomes (discharge or death) in the clinical information, 30 genes that were the most important for the outcomes were obtained, and the top 5 genes extracted according to the Gini index were regarded as somatic mutation indicators.
Differential Expression Analysis
R package edgeR was used to analyze the differential expression of CDKN2A in 114 samples. Log2FC > 1 was regarded as CDKN2A upregulation, log2FC < −1 was regarded as CDKN2A downregulation, and other conditions were regarded as normal. CDKN2A expression was regarded as a transcriptome expression indicator.
Construction and Validation of a Prognostic Model
The cox risk proportion model was constructed based on gender and smoking index in clinical data, together with somatic mutation index and transcriptome expression indicator. The risk score for each sample was calculated using the predict function, with the median of all sample risk scores as the cutoff value. The samples were divided into high-risk and low-risk groups for survival analysis. We obtained the risk score as follows: risk score = (0.225946*S1PR1) + (−0.136905*DOCK7) + (0.622192*DDX4) + (0.008847*LAMB3) + (0.117005*IPO5) + (0.390656*CDKN2A). Risk prediction models were presented in nomogram and their predictive performance was evaluated using a receiver operator characteristic (ROC) curve. The area under the curve (AUC) value was calculated to verify the reliability and the accuracy, sensitivity, and specificity were calculated as previously described. 21
Statistics
This study is a retrospective study. The reporting of this study conforms to TRIPOD guidelines. 22 All statistical data were carried out in R program (version 4.0.0). Kaplan–Meier analysis was used to assess survival difference, and log-rank test was used for statistical significance. Cox proportional hazards regression was used to analyze the factors affecting the survival of patients with lung cancer. P < 0.05 was considered to be statistically significant.
Results
Characterization of Somatic Mutations in All Samples
First, we analyzed the somatic mutation in 567 samples. The clinical characteristics of 567 samples were shown in Table 1. As shown in Figure 1, missense mutation accounted for the highest proportion of all variant classification and single nucleotide polymorphism (SNP) accounted for the highest proportion of variant types. As for single nucleotide variation (SNV), the mutation from C to A accounted for the most. The average number of mutations in each sample was 166. According to the number of variants, the top 10 genes were titin (TTN), mucin 16 (MUC16), ryanodine receptor 2 (RYR2), complement C1r/C1s, Uegf, Bmp1 (CUB) and Sushi multiple domains 3 (CSMD3), LDL (low-density lipoprotein) receptor-related protein 1B (LRP1B), tumor protein p53 (TP53), usherin (USH2A), zinc finger homeobox 4 (ZFHX4), xin actin binding repeat containing 2 (XIRP2), and KRAS proto-oncogene. Then we sorted according to the number of mutated samples and showed the top 30 genes (Figure 2).

Characterization of somatic mutations in all samples. Abbreviations: CSMD3, CUB and Sushi multiple domains 3; DEL, deletion; Ins, insertion; LRP1B, LDL (low-density lipoprotein) receptor-related protein 1B; MUC16, mucin 16, RYR2, ryanodine receptor 2; SNV, single nucleotide variation; SNP, single nucleotide Polymorphism; TP53, tumor protein p53; TTN, titin; USH2A, usherin; XIRP2, xin actin binding repeat containing 2; ZFHX4, zinc finger homeobox 4.

The top 30 genes ranked by the number of mutated samples. Abbreviations: CSMD3, CUB and Sushi multiple domains 3; LRP1B, LDL (low-density lipoprotein) receptor-related protein 1B; MUC16, mucin 16, RYR2, ryanodine receptor 2; TP53, tumor protein p53; TTN, titin; USH2A, usherin; XIRP2, xin actin binding repeat containing 2; ZFHX4, zinc finger homeobox 4; FLG, filaggrin; SPTA1, spectrin-alpha erythrocytic 1; FAT3, FAT atypical cadherin 3; NAV3, neuron navigator 3; COL11A1, collagen type XI alpha 1 chain; ZNF536, zinc finger protein 536; CSMD1, CUB and Sushi multiple domains 1; ANK2, ankyrin 2; PCLO, piccolo presynaptic cytomatrix protein; PCDH15, protocadherin related 15; ADAMTS12, ADAM metallopeptidase with thrombospondin type 1 motif 12; KEAP1, kelch like ECH associated protein 1; TNR, tenascin R; PAPPA2, pappalysin 2; DNAH9, dynein axonemal heavy chain 9; ADGRG4, adhesion G protein-coupled receptor G4; RP1L1, RP1 like 1.
Clinical Characteristics of Patients Involved in the Study.
Note: Lifelong nonsmoker (<100 cigarettes smoked in lifetime) = 1.
Current smoker (includes daily smokers and nondaily smokers or occasional smokers) = 2.
Current reformed smoker for >15 years (>15 years) = 3.
Current reformed smoker for ≤15 years (≤15 years) = 4.
Current reformed smoker, duration not specified = 5''.
Abbreviations: Mx, metastasis cannot be measured; Nx,Cancer in nearby lymph nodes cannot be measured; Tx, Main tumor cannot be measured.
Screening of Somatic Mutation Indicators and Transcriptome Expression Indicators
Taking the survival or death status of the follow-up information in clinical data as a factor, we use random forest algorithm to screen the indicators of somatic mutation according to Gini indicators. The first 30 predictors were shown in Figure 3. We put the top 5 genes into the risk prediction model as somatic mutation indicators (sphingosine 1-phosphate receptor 1 [SIPR1], dedicator of cytokinesis 7 [DOCK7], DEAD-box helicase 4 [DDX4], laminin subunit beta 3 [LAMB3], and importin 5 [IPO5]).

Random forest algorithm was used to screen the indicators of somatic mutation according to Gini indicators. Abbreviations: CSMD3, CUB and Sushi multiple domains 3; LRP1B, LDL (low-density lipoprotein) receptor-related protein 1B; MUC16, mucin 16, RYR2, ryanodine receptor 2, TP53, tumor protein p53; TTN, titin; USH2A, usherin; XIRP2, xin actin binding repeat containing 2; ZFHX4, zinc finger homeobox 4.
CDKN2A has been widely reported as a clinical prognostic factor for lung cancer.19,23,24 Therefore, we downloaded the transcriptome expression data of TCGA–LUAD, extracted 57 pairs of samples, and viewed the difference in CDKN2A gene expression between the samples and the control (Figure 4). It could be seen that CDKN2A was differentially expressed in these 57 pairs of samples, so we included CDKN2A as a transcriptome expression indicator in the risk prediction model.

CDKN2A gene expression between the samples (n = 57) and the control (n = 57). Abbreviations: CDKN2A, cyclin-dependent kinase inhibitor 2A; TCGA, The Cancer Genome Atlas.
Construction of a Prognostic Nomogram
We developed a prognostic nomogram by combining somatic mutation index, transcriptome expression indicator, gender, and smoking (Figure 5). Kaplan–Meier survival curve proved the difference of overall survival rate between high-risk and low-risk score groups (Figure 6A, P = 0.0323), indicating that the model has a certain latent capacity in forecasting the prognosis of lung cancer sufferers. ROC curve confirmed the reliability of the risk model in predicting the survival rates of 3-year (AUC = 0.609), 5-year (AUC = 0.673), and 10-year (AUC = 0.698) (Figure 6B). Each AUC value was >0.6, clearly indicating that this model has good potential in calculating the prognosis of patients with lung cancer. Furthermore, the accuracy, sensitivity, and specificity of our model are 60.48%, 56.83%, and 63.47% for predicting the 3-year survival rate of lung cancer; 69.85%, 55.35%, and 74.62% for predicting the 5-year survival rate of lung cancer; and 76.19%, 56.71%, and 86.23% for predicting the 10-year survival rate of lung cancer.

Nomogram for predicting 3-year, 5-year, and 10-year survival for patients with lung cancer in TCGA data set based on somatic mutation indicators, CDKN2A and clinicopathological parameters (smoking and gender). Abbreviations: CDKN2A, cyclin-dependent kinase inhibitor 2A; DDX4, DEAD-box helicase 4; DOCK7, dedicator of cytokinesis 7; IPO5, importin 5; LAMB3, laminin subunit beta 3; SIPR1, sphingosine 1-phosphate receptor 1; TCGA, The Cancer Genome Atlas.

Kaplan–Meier survival analysis of the low-risk group and high-risk group (A). ROC curves were used to confirm the discriminative ability of nomogram (B). Abbreviations: AUC, area under the curve; ROC, receiver operator characteristic.
Discussion
In this study, 5 somatic mutation indicators (SIPR1, DOCK7, DDX4, LAMB3, and IPO5) were included in the risk model. SIPR1 is the receptor of sphingolipid product S1P, which is highly expressed in breast cancer, gastric cancer, and hepatocellular carcinoma, and indicates the poor prognosis of patients, which is related to the regulation of drug resistance and metastasis of cancer cells by S1P/SIPR1 signal.25–27 DOCK7 is a replication stress regulator, which mediates the replication stress response by activating Rac to promote replication protein A stability. DOCK7 is highly expressed in ovarian cancer and glioblastoma and negatively correlated with the overall survival rate of patients.28,29 DDX4 is an IgG autoantigen, which is located in mitotic apparatus and widely expressed in somatic cell-derived cancer cell lines, and mainly plays a role in promoting tumors metastasis. 30 LAMB3-encoded LM-332 protein (extracellular matrix protein) participates in important biological behaviors such as cell differentiation, adhesion, and survival, and is related to the metastatic ability of many types of cancers (colorectal, pancreatic, thyroid, and lung cancers).31–33 IPO5 is a nuclear transporter, which can lead to abnormal localization of oncogenes and tumor suppressor genes, thus causing drug resistance and abnormal proliferation of cancer cells. 34 Pauline J van der Watt et al 35 incorporated IPO5 into the diagnostic markers of cervical and esophageal cancers, and obtained high sensitivity and specificity. In conclusion, all 5 indicators are related to drug resistance, metastasis or proliferation of cancer, and some indicators have been considered as markers for early diagnosis or prognosis of cancer. However, the prediction efficiency of single gene is obviously not as good as that of multigene and multi-index models. Jianbo Pan et al 36 included DDX4 in the biomarker group of early diagnosis of lung cancer, with a sensitivity of 73.5% and a specificity of more than 85%. Similarly, DDX4 was also used as one feature gene in our model. However, as the model of Jianbo Pan et al is a diagnostic model while our model is a prognostic model, we could not make a direct comparison. Han-Jun Cho et al 37 identified 6 mutant genes related to the prognosis of LUAD through machine learning, but the ROC curve and AUC were unknown. The study of Han-Jun Cho et al 37 also pointed out the relationship between gene mutation and the prognosis of LUAD, which indicates that the possibility of applying somatic mutation genes to clinically guide the prognosis of LUAD.
Lung cancer is the result of individual or combined action of a variety of risk factors. To comprehensively consider a variety of risk factors can more effectively screen out the high-risk population of lung cancer. In this study, we included smoking and gender as 2 major risk factors, which are also important risk factors of lung cancer that are often considered clinically. Lung cancer risk prediction models have been developed that incorporate both gender and smoking indicators with convincing discrimination.38,39 In addition, the combined use of multi-angle, multifactor tumor molecular markers for the detection of lung cancer is more accurate than a single test. Therefore, we also included the indicator CDKN2A. CDKN2A was silenced in more than 70% of lung squamous cell carcinoma samples. 40 Chunkang et al 21 constructed a lung cancer diagnosis model with 6 genes including CDKN2A (p.16), which can effectively diagnose early lung cancer and indicate cancer risk. In addition, Wei Liu et al 19 reported that CDKN2A indicates a poor prognosis of lung cancer, which was consistent with our findings. In this study, 5 somatic mutation indicators and CDKN2A were combined with smoking and gender to construct a lung cancer early warning model, and the model may provide certain help for clinical lung cancer early warning by predicting the survival rate of patients with lung cancer.
However, this study had some limitations. Due to the limitation of objective conditions, this study only collected 2 groups of samples of lung cancer and healthy controls for the construction of early warning model. In view of the complexity of clinical tumor diagnosis, it is necessary to collect benign lung diseases and other tumor cases in the future to improve the specificity and accuracy of discrimination. In addition, this study only performed internal validation, and future external validation through large-sample, multicenter, prospective studies is required.
Conclusion
The novelty of this study is established as an early warning model for lung cancer by machine learning based on clinical characteristics of smoking and gender in the samples, the somatic mutation gene and CDKN2A gene. The predictive effect of this early warning model for patients with lung cancer may be suitable for clinical practice, which may provide targeted guidance for the early prediction of patients with lung cancer.
Footnotes
Availability of Data and Materials
The analyzed data sets generated during the study are available from the corresponding author on reasonable request.
Data Availability Statement
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Zhejiang Medical and Health Research Project (grant number: 2021KY1235), Lishui Science and Technology Bureau Project (grant numbers: 2020077571 and 2022ZDYF11).
