Abstract
Importance:
Nomogram prognostic models can facilitate cancer patient treatment plans and patient enrollment in clinical trials.
Objective:
The primary objective is to provide an updated and accurate prognostic model for predicting the survival of advanced non-small-cell lung cancer (NSCLC) patients, and the secondary objective is to validate a published nomogram prognostic model for NSCLC using an independent patient cohort.
Design:
1817 patients with advanced NSCLC from the control arms of 4 Phase III randomized clinical trials were included in this study. Data from 524 NSCLC patients from one of these trials were used to validate a previously published nomogram and then used to develop an updated nomogram. Patients from the other 3 trials were used as independent validation cohorts of the new nomogram. The prognostic performances were comprehensively evaluated using hazard ratios, integrated area under the curve (AUC), concordance index, and calibration plots.
Setting:
General community.
Main outcome:
A nomogram model was developed to predict overall survival in NSCLC patients.
Results:
We demonstrated the prognostic power of the previously published model in an independent cohort. The updated prognostic model contains the following variables: sex, histology, performance status, liver metastasis, hemoglobin level, white blood cell counts, peritoneal metastasis, skin metastasis, and lymphocyte percentage. This model was validated using various evaluation criteria on the 3 independent cohorts with heterogeneous NSCLC populations. In the SUN1087 patient cohort, the continuous risk score output by the nomogram achieved an integrated area under the receiver operating characteristics (ROC) curve of 0.83, a log-rank
Conclusions:
This nomogram model based on basic clinical features and routine lab testing predicts individual survival probabilities for advanced NSCLC and exhibits cross-study robustness.
Introduction
Lung cancer is the leading cause of cancer-related death. 1 Prognostic models that integrate multiple clinical attributes offer greater precision in predicting outcomes and can also aid in defining patient enrollment criteria for clinical trials. Several prognostic models have been developed for advanced lung cancer, but these models did not undergo external validation against multiple independent data sets.2-5 Given the variety of regimens considered appropriate treatment in advanced NSCLC, 6 this calls into question the generalizability of these models. In addition, user-friendly online implementations of diagnostic/prognostic models have greatly enhanced patient care for breast cancer, 7 but no such tools are available for lung cancer prognosis.
Recently, public accessibility of clinical trial data has led to a paradigm shift in clinical research. The Food and Drug Administration (FDA) Amendments Act passed in 2007 resulted in the registration and reporting of most clinical trials in the United States on ClinicalTrials.gov. The Trial and Experimental Studies Transparency Act passed in 2012 required that the results of interventional trials be reported to a publicly available online database. 8 Shared databases are becoming a highly valuable resource for the construction, validation, and subsequent recalibration of prognostic models. One such data sharing initiative is Project Data Sphere, LLC, an independent, not-for-profit initiative of the CEO Roundtable on Cancer’s Life Sciences Consortium that broadly shares de-identified comparator arm data from late-phase oncology clinical trials with researchers.9,10 How to best use such valuable shared clinical trial data to improve medical research and patient care is still in the exploration stage.
The first objective of this study was to use publicly available clinical trial data to perform external validation on a previously published nomogram prognostic model, which is most commonly used for patients with advanced NSCLC. 11 The second objective was to develop and validate a new model for advanced NSCLC, adhering to recommendations regarding transparency in methods and performing appropriate external validation. 4 In addition, external validation was performed using data from multiple independent data sets to demonstrate generalizability of the model. The third objective was to develop a public online implementation of the new nomogram and a number of previously published prognostic models.11-14
Patients and Methods
Patients
De-identified NSCLC patient data from Project Data Sphere were used in model construction and validation. Application for data access was submitted to Project Data Sphere through https://projectdatasphere.org/projectdatasphere/html/registration and was approved. We consented to and complied with the Data User Agreement. It included 1817 patients from the comparator arms of 4 Phase III randomized clinical trials with advanced NSCLC. The 4 trials used were as follows: CA031 (n = 524), a trial comparing nab-paclitaxel and carboplatin to solvent-based paclitaxel as first-line therapy for patients with advanced NSCLC; 15 SUN1087 (n = 480), a trial comparing sunitinib plus erlotinib to erlotinib alone in patients with advanced NSCLC refractory to 1 or 2 chemotherapy regimens; 16 SAVEONCO (n = 358), a trial assessing the efficacy of semuloparin sodium for prevention of venous thromboembolism in patients with a variety of advanced solid tumors; 17 and VITAL (n = 455), a trial comparing (ziv-)aflibercept and docetaxel to docetaxel alone for advanced NSCLC refractory to treatment with platinum-based chemotherapy. 18 The characteristics of the comparator arms of the 4 trials used for this study are shown in Table 1, with greater detail on inclusion and exclusion criteria in Supplementary Table 1. The 4 trials include patients receiving first- and/or second-line chemotherapy, with regimens varying between studies. Because this study focused on patient prognosis, only patients from the comparator arms for these trials were included. SAVEONCO includes patients with a variety of different advanced stage cancers, and only patients with lung cancer were used for the analysis. For the other 3 studies, all patients from the comparator arms were used. Figure 1A shows how each of these data sets were used for both model development and validation in this study.
Summary of the comparator arm of the clinical trial data used in this study. Inclusion criteria are shown only in brief. Greater detail and information on exclusion criteria is shown in Supplementary Table 1.
Abbreviations: NSCLC, non-small-cell lung cancer; VTE, venous thromboembolism.

Flowchart demonstrating use of various models and data sets in this study. (A) CA031 was used as a training set for the new nomogram. SUN1087, SAVEONCO, and VITAL were used as external validation sets for the new nomogram. All models involved in this study are implemented on a web portal. (B) Exploratory survival analysis for the 4 data sets used in this study.
Nomogram development
The new nomogram was developed using CA03115 as a training set and then validated in 3 independent validation sets: SUN1087,
16
SAVEONCO,
17
and VITAL.
18
Clinical variables that were present in both CA03115 and at least 1 validation data set were selected as potential co-variables in the model. Overall survival was used as the primary outcome for the nomogram model. Univariate analyses were first used to establish the associations between potential predictors and overall survival in the training set. Co-variables with statistically significant associations with survival (
Survival analysis
Overall survival time was calculated from the date of randomization until death or the date of last follow-up. Survival curves were estimated using the Kaplan-Meier product-limit method. 21 Differences in the survival curves were compared using a log-rank test. A univariate Cox proportional-hazards model 22 was used to determine the association between a continuous variable and overall survival in univariate analysis.
Validation criterion
For each study, patients without histology demonstrating adenocarcinoma (AD), squamous cell carcinoma (SCC), and large cell carcinoma (LC) of the lung, such as those with uncertain histology, were excluded. For the purposes of validation, missing data were imputed as the population median. Four criteria were used for the evaluation of the prediction performance of Hoang et al’s 11 model and our new nomogram:
The patients in the testing data set were split into 20 roughly equal groups by their predicted survival probabilities. The number of samples with true results (alive or dead at specified time points) equal to the event class (alive) were determined. The event rate was determined for each bin.
The generated calibration plot is essentially a scatter plot of the observed event rate by the mid-point predicted probability value of the bins. The confidence intervals on the estimated proportions are constructed using the binomial test.
Implementation of previously published models
To facilitate clinician, researcher, and patient utilization of the prognostic models published previously and of our new model, we created a user-friendly Web server for our model together with the 4 published prognostic models for lung cancer11-14 shown in Supplementary Table 2 (http://lce.biohpc.swmed.edu/lungcancer/nomogram). Details of the implementations of the 4 published models are described in Supplementary Table 2.
Results
Clinical trial and patient population characteristics
Thirty-one patients from CA03115 were excluded for having a cancer type other than AD, SCC, and LC. Three patients were excluded from this study because of a lack of follow-up information or because >50% of covariates were missing. One patient was excluded from the VITAL cohort because survival information was missing. Kaplan-Meier plots for follow-up time and follow-up status for all 4 studies are shown in Figure 1B. The 1-year survival rates for these 4 studies range from 0.47 to 0.709. A summary of the data distribution for the 21 variables selected for evaluation is shown in Supplementary Table 3.
Validation of a previously published nomogram
As a first step, we performed validation of the previously published nomogram by Hoang et al
11
in 2005 with data from CA031,
15
which contains all variables used in the nomogram. Hoang et al’s prognosis is the most cited and used prognostic model for advanced NSCLC, but no independent validations have been performed since its publication in 2005 due to a lack of validation cohorts. Supplementary Figure 1 shows that patients in the CA031 cohort in the high-risk group predicted by the Hoang model have significantly worse survival outcomes compared with those in the low-risk group (
Developing a new prognostic nomogram
Results of univariate analysis of the association between each eligible variable and patient overall survival within CA03115 are displayed in Table 2. In total, 12 variables had a
Univariate analysis of prognostic survival for the CA031 study. The columns are variable name, univariate likelihood ratio test
Abbreviations: LC, large cell carcinoma; SCC, squamous cell carcinoma; AD, adenocarcinoma.
Hazard ratios (HR) and 95% confidence intervals of nomogram parameters. Variables marked by an asterisk are logarithm transformed.
Abbreviations: LC, large cell carcinoma; SCC, squamous cell carcinoma; AD, adenocarcinoma; ECOG, Eastern Cooperative Oncology Group.
Validation of the new prognostic nomogram
The proposed nomogram was developed using the CA03115 patient cohort as a training set and validated in 3 independent patient cohorts. The validation results are presented in Figure 2 and Supplementary Figure 2(b) to (d). In Figure 2, the risk groups are defined using the median value of the predicted 2-year survival probabilities. The Hoang et al nomogram was also validated on these 3 cohorts for comparison. In the SUN108716 patient cohort, the continuous risk score output by the new nomogram achieved an integrated area under the ROC curve of 0.83 from the 6th month to the 18th month and a concordance index of 0.717 (Supplementary Figure 2(b)). The log-rank

Evaluation of nomograms by log-rank test. Evaluation of previous nomogram and new nomogram on testing data sets including (A, B) the SUN1087 study, (C, D) the SAVEONCO study, and (E, F) the VITAL study. Each panel shows the separation of the Kaplan-Meier estimator by the dichotomized risk score for the testing patients, and
We also validated the performance of our new nomogram as well as the Hoang et al nomogram using the calibration plot (Figure 3). From both Figures 2 and 3, we can clearly see that our new nomogram outperforms the Hoang model at least on these 3 neutral data sets.

Calibration plots. Calibration plots of previous nomogram and new nomogram were generated using 3 testing data sets: (A, B) the SUN1087 study, (C, D) the SAVEONCO study, and (E, F) the VITAL study.
Building a user-friendly Web server for the updated and previous nomograms
We have also provided an online version of this nomogram (Supplementary Figure 3) to facilitate its widespread use by physicians and researchers (http://lce.biohpc.swmed.edu/lungcancer/nomogram/index.php). Online implementations of several previously developed models are also available11-14 (Supplementary Figure 4). Comparison of overall survival probabilities can be made between these nomograms by inputting patients’ clinical features and reading output generated by the Web server.
Discussion
The landscape of lung cancer patients and treatment has shifted over time. Therefore, it is of value to provide more updated nomograms given new data. We developed and validated a prognostic model for patients with advanced stage NSCLC treated with chemotherapy using up-to-date patient data collected after 2007. The nomogram was built using data from CA03115 and validated with 3 independent clinical trials. The new nomogram meets the guidelines for AJCC endorsement. 25 A wide range of clinical features have been incorporated for use in prognostic models in the past, and a detailed comparison with other published prognostic models is presented in Supplementary Table 2. Robust external validation demonstrated discriminatory power in the nomogram through 3 different statistical measures for each validation data set. First, dichotomization of patients into high- and low-risk subgroups based on their calculated nomogram score showed a statistically significant difference in the survival curves between groups. Second, the nomogram had a high c-index in each validation set, which means that for any given pair of patients, there is a high probability that the nomogram score correctly predicts which patient will have better survival. Third, the nomogram performed well by area under the ROC curve at every measured time-point, indicating that the nomogram had a strong ability to predict whether or not a patient would be alive at a given time after randomization. In addition, our analyses demonstrated the good calibration performance of our new nomogram.
Each of the validation data sets contained patients receiving different types of treatment, including first-, second-, and third-line treatments and both cytotoxic and targeted regimens. Survival analysis thus unsurprisingly demonstrated considerable heterogeneity in survival characteristics between the studies used to validate the nomogram. In spite of this, the proposed nomogram demonstrated accuracy and robust performance across multiple testing data sets. A major strength of this study, therefore, is that we have demonstrated the external validity of our model in a broad range of clinical scenarios. The 3 testing data sets include patients on both targeted and non-targeted therapies. This is in contrast to many previously published prognostic tools for advanced stage NSCLC, which did not undergo external validation and therefore have not been proven to be generalizable to the disparate situations that clinicians are likely to encounter. 2
Moreover, in the modern treatment era, conventional cytotoxic chemotherapy still remains a component of treatment for almost all patients with lung cancer, although immunotherapy and molecular-targeted therapy are more and more commonplace. Only a small minority of patients (about 20%) will have a druggable kinase alteration such as epidermal growth factor receptor (EGFR) or anaplastic large-cell lymphoma kinase (ALK). For example, the FDA recently limited the indication of EGFR inhibitor Erlotinib to only those cases with specific EGFR mutations. And, although immunotherapy has become a treatment option for patients, only about 25% of patients have high-level (⩾ 50%) PDL1 expression for which first-line immunotherapy is better than chemotherapy. Therefore, the variables included in our proposed model should be applicable to the general patient population. However, in the future, it will be interesting to consider the key biomarkers in NSCLC, including EGFR genotype, ALK genotype, and PDL1 status, together with clinical covariates for patient prognosis when such data sets with large sample sizes are available.
One challenge for using public data arose from the missing data across cohorts. Performance status (ECOG or Karnofsky) is the only variable included in all 5 prognostic models, stressing the importance of this variable for prognosis. But, there is a certain degree of dissimilarity between the other variables included between the 5 models, which calls for a direct comparison of these models. However, this is not possible within the scope of this study as some clinical variables that are frequently used in prognostic modeling of NSCLC, such as serum lactate dehydrogenase, were not available from the public clinical trial data.
By implementing prognostic models on a Web server, we provide an easy way for researchers and clinicians to access the predicted overall survival probabilities and also to make straightforward comparisons between survival probabilities generated by different models. Our hope is that greater transparency in data reporting for clinical trials will be accompanied by greater access for clinicians to prognostic models that can enhance precise tailoring of patient management.
Supplemental Material
Suppl_material – Supplemental material for Development and Validation of a Nomogram Prognostic Model for Patients With Advanced Non-Small-Cell Lung Cancer
Supplemental material, Suppl_material for Development and Validation of a Nomogram Prognostic Model for Patients With Advanced Non-Small-Cell Lung Cancer by Tao Wang, Rong Lu, Sunny Lai, Joan H Schiller, Fang Liz Zhou, Bo Ci, Stacy Wang, Xiaohan Gao, Bo Yao, David E Gerber, David H Johnson, Guanghua Xiao and Yang Xie in Cancer Informatics
Footnotes
Acknowledgements
The authors acknowledge Tsung-Wei Ma for help with downloading the data and converting the format and Jessie Norris for proofreading the manuscript.
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Institutes of Health (1R01GM115473, 5R01CA152301, P50CA70907, 5P30CA142543, and 1R01CA172211), the National Cancer Institute (NCI) Midcareer Award in Patient-Oriented Research (K24CA201543-01; to D.E.G.), and the Cancer Prevention and Research Institute of Texas (RP120732 and RP180805).
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
TW and RL contributed equally.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
