Integrating Machine Learning for Early Mortality Prediction in Lung Adenosquamous Carcinoma: A Web-Based Prognostic Model

Abstract

Introduction

Combined with the characteristics of adenocarcinoma and squamous cell carcinoma, lung adenosquamous carcinoma (ASC) is an uncommon histological subtype of lung cancer with more aggressive biological behavior. This study aimed to quantify the 90-day mortality rate in patients with ASC, identify associated features, and develop a predictive machine learning model.

Methods

This retrospective study obtained data from the Surveillance, Epidemiology, and End Results (SEER) program database, covering the period from 2000 to 2018. Through univariate logistic regression and Lasso analyses, significant prognostic features were determined. We developed predictive models using XGBoost, logistic regression, and AJCC staging algorithms, assessing their performance via metrics such as the Area Under the Receiver Operating Characteristic Curve (AUC), Decision Curve Analysis (DCA), Kolmogorov-Smirnov (KS) statistic, and calibration plots. Restricted Cubic Splines (RCS) were employed to assess potential non-linear relationships between continuous features and survival outcomes.

Results

Our analysis of 2820 eligible patients identified 6 clinical features significantly affecting outcomes. The XGBoost model exhibited exceptional discriminatory power, with AUC scores of 0.97 in the training set and 0.84 in the validation set, surpassing other models in all datasets according to AUC, KS score, DCA, and calibration analyses. RCS analysis showed a non-linear association between tumor size and prognosis, with a cutoff size of 44 mm. Moreover, we integrated the model into a web-based platform to enhance its accessibility.

Conclusions

We present a novel machine learning model, supported by an easily accessible web-based platform, to guide personalized clinical decision-making and optimize treatment strategies for patients with ASC.

Keywords

lung adenosquamous carcinoma early mortality XGBoost machine learning prognostic biomarker

Introduction

Lung cancer remains one of the deadliest and most common malignancies globally, characterized by its high prevalence and mortality rates.¹ Among its types, Non-Small Cell Lung Cancer (NSCLC) and Small Cell Lung Cancer (SCLC) are the primary histopathological categories, with NSCLC further segmented into adenocarcinoma (AC), Squamous Cell Carcinoma (SCC), and adenosquamous carcinoma (ASC). According to the 2021 WHO pathological classification, ASC is recognized as a unique subtype, marked by the presence of both AC and SCC components, with each constituting at least 10% of the tumor.² ASC distinguishes itself not only by its mixed features but also by its notable resistance to adjuvant chemotherapy and a higher tendency for local recurrence or distant metastasis compared to other NSCLC histologies.^3,4

Early mortality is a critical marker of less-than-ideal outcomes in lung cancer treatment, often indicating potentially unsuitable therapy choices. The aggressive nature of ASC suggests that early mortality is more common in ASC patients than in those with AC or SCC. Yet, there is a significant gap in data regarding the early mortality rates among ASC patients, and the lack of a prognostic model for this group presents a considerable challenge in clinical practice, hindering the improvement of treatment strategies and patient care.

The field of oncology has recently benefited from significant advancements in modeling, especially with the incorporation of machine learning techniques. These computational approaches deliver remarkable precision in predicting cancer progression and treatment outcomes, including responses to targeted therapies for gene mutations and immunotherapy.⁵ Moreover, machine learning algorithms excel at analyzing large and complex datasets, uncovering patterns that are often imperceptible to human observers. This capability paves the way for more personalized and effective treatment strategies.⁶

In this study, we utilize the Surveillance, Epidemiology, and End Results (SEER) database to craft prognostic models based on machine learning techniques, aiming to predict early mortality among patients with ASC. A thorough comparative analysis assesses these models against conventional logistic regression and AJCC staging systems. Our research culminates in the creation of an easily accessible web-based classifier that delivers insightful visualizations, establishing itself as an invaluable aid in refining clinical decision-making.

Methods

This retrospective study was reported in accordance with the TRIPOD guidelines.⁷

Raw Data Source

The SEER program, operated by the National Cancer Institute (NCI), stands as a comprehensive source of cancer statistics in the United States. This program gathers and shares data regarding cancer incidence and patient survival, sourced from registries covering roughly 34.6% of the U.S. population. Our study accessed these invaluable datasets in compliance with the SEER Research Data Agreement. We extracted clinicopathological information using SEER*Stat software version 8.4.0.1, which is available at https://seer.cancer.gov/data-software/.

Inclusion Criteria

In this study, we focused on individuals diagnosed with ASC from 2000 to 2018. These cases were classified according to the International Classification of Diseases for Oncology, Third Edition (ICD-O-3), using the site codes C34.0-C34.9 and the histological type of code 8560. To be included in the study, patients had to meet several criteria: a pathologically confirmed ASC diagnosis and evidence of a sole primary tumor without indications of other malignancies in the database.

Exclusion Criteria

Patients were excluded from the study for several reasons. Those missing detailed demographic or comprehensive clinicopathological data, such as information about the primary tumor’s site, size, laterality, histologic grade, and American Joint Committee on Cancer (AJCC) stage, were omitted. We also excluded individuals if their records did not fully document surgery, chemotherapy or radiotherapy treatments or if their survival status and follow-up data were absent.

Study Endpoint

The primary endpoint of the study was early mortality, defined as death occurring within 90 days following diagnosis among patients. Those who survived past this 90-day mark were used as a comparative cohort, enabling us to analyze the factors influencing early mortality in patients with ASC.

Baseline Characteristics Presentation

In the initial analysis of our study, we carefully outlined the clinical and demographic characteristics of the participant population, summarizing continuous variables by their mean and standard deviation and describing categorical variables with frequencies and percentages. Our comprehensive analysis spanned 17 features, focusing on uncovering independent prognostic features among patients with ASC. This included an extensive examination of demographic features such as age, sex, race, and marital status, alongside the clinicopathological attributes of tumors like site, size, laterality, grade, AJCC stage and distant metastases. Additionally, we delved into treatment-related information, scrutinizing the impact of surgery, chemotherapy and radiotherapy on patient outcomes. This integrative approach enabled a nuanced understanding of how diverse demographic and clinicopathological factors, coupled with treatment modalities, contribute to the prognosis and therapeutic strategies for ASC, thereby providing a holistic view of the factors influencing patient outcomes in this context.

Feature Engineering and Data Balancing

During the feature engineering phase and in tackling data imbalance, our study first utilized Spearman correlation analysis to probe the relationships among various data features. This analysis is particularly adept at discerning monotonic relationships, illuminating possible connections within the dataset. This initial step was instrumental in identifying underlying patterns, supported by the generation of a correlation heat map for a visual representation of these associations. To further refine machine learning model performance, we applied categorical label encoding. This technique transforms categorical variables into binary matrices, effectively delineating category memberships through one-hot encoding. This transformation is crucial for machine learning algorithms to process categorical data accurately.

Addressing the prevalent issue of class imbalance, especially concerning outcome status, was achieved through the implementation of the Synthetic Minority Over-sampling Technique (SMOTE). SMOTE combats class imbalance by synthesizing new samples from the minority class. It operates by selecting a random instance from the minority class, identifying its k-nearest neighbors, and generating synthetic instances along the line segments connecting the chosen point to its neighbors. This approach not only ensures a more balanced class distribution but also significantly bolsters the model’s capacity to discern patterns linked to the minority class, thereby substantially improving predictive accuracy for these cases.

Model Construction and Validation Approach

Eligible patients were carefully split into training and validation datasets in a 7:3 ratio via random allocation, laying a solid foundation for subsequent analysis. The training set was instrumental in building the prognostic model and refining the risk assessment classification, while the validation set, kept distinct from the training data, played a key role in assessing the model’s efficacy and the credibility of its predictions. To optimize the model’s precision and generalizability, the classification threshold was refined using 10-fold cross-validation within the training dataset, aiming to maximize the Area Under the Receiver Operating Characteristic Curve (AUC).

The validity and robustness of the model were strengthened initially by identifying potential prognostic markers through univariate logistic regression (P ≤ 0.01). This was followed by the utilization of the Lasso model for selecting key prognostic features, effectively reducing the impact of less relevant variables and isolating significant predictors (P < 0.05). Consequently, logistic regression was employed to construct traditional predictive models, incorporating the crucial determinants identified by Lasso analysis. Simultaneously, an AJCC staging model was developed based on its unique criteria.

The study explored the XGBoost model’s efficacy, known for its advanced gradient boosting framework, in predicting overall survival rates. XGBoost distinguishes itself by sequentially combining weak learners into a strong predictive ensemble, making it particularly suited for survival analysis due to its capacity to manage high-dimensional data and elucidate complex relationships between prognostic features and patient outcomes. To enhance the XGBoost model’s performance, Bayesian optimization was employed for hyperparameter tuning within the machine learning workflow. This process, spanning 50 iterations and starting with an initial 15 evaluations, aimed to optimize the model’s efficiency. With mechanisms for saving predictions and workflows, the optimization process was designed to halt if no improvement was noted after 15 consecutive iterations, ensuring efficient resource use and focusing on achieving the highest accuracy in predicting patient outcomes.

Model Performance Evaluation

The evaluation of the model’s performance was conducted with thorough precision using a suite of established metrics, ensuring a comprehensive examination of its predictive capabilities. This in-depth analysis included Receiver Operating Characteristic (ROC) curve analysis, calibration curve analysis, Decision Curve Analysis (DCA), and the Kolmogorov-Smirnov (KS) statistic, each offering unique insights into the model’s effectiveness.

The ROC curve analysis, by calculating the AUC, served as a fundamental indicator of the model’s discriminative power. This metric underscores the model’s proficiency in differentiating between varying outcomes, representing a key aspect of prognostic accuracy. Calibration curve analysis further scrutinized the alignment between the model’s predicted probabilities and the actual observed outcomes, aiming for a model whose predictions closely mirror reality, as depicted by adherence to the 45° line in the calibration plot.

DCA provided a perspective on the model’s clinical utility by evaluating the net benefits across different probability thresholds, crucial for understanding the practical implications of the model’s clinical application. Lastly, the KS statistic emerged as a vital measure for assessing the disparity in the cumulative distribution functions of distinct samples. Within the context of machine learning and model assessment, a higher KS statistic indicates a more pronounced separation between the distributions of the positive and negative classes as predicted by the model, highlighting its capacity to accurately segregate outcomes.

Model Interpretation

The study leveraged SHAP (SHapley Additive exPlanations) to demystify the machine learning model, shedding light on the contribution of individual variables towards predictions and promoting a higher level of model transparency. SHAP effectively pinpoints influential features, thereby enriching the comprehension of how the model arrives at its predictions. This is crucial for making informed decisions and refining the model. The beeswarm summary plot, a distinctive element of SHAP, visually presents the influence of variables, offering a holistic understanding of their effect on the model’s outcomes. This visualization aids practitioners in identifying the key features that drive model predictions, contributing to more trustworthy and substantiated results in sophisticated machine learning projects.

To deepen the understanding of feature impact on model performance, the study also incorporated breakdown and partial dependence analyses. Breakdown analysis offers an interpretive view by dissecting a model’s prediction for a particular instance, detailing the contribution of each input feature. This analysis is invaluable for grasping how individual features influence the model’s predictions on a single-case basis, providing a granular view into the model’s reasoning process, particularly in complex models.

Conversely, partial dependence analysis explores the effect of significant features on the predicted outcome, while controlling for the influence of all other features. This analysis is instrumental in deciphering the complexities of advanced models like ensemble methods or neural networks, where input-output relationships can be convoluted. Through Partial Dependence Plots (PDPs), this approach isolates and scrutinizes the effect of specific features, revealing how the model behaves across a range of values for those features. Such plots illuminate the independent contribution of certain features to predictions, elucidating their individual impact irrespective of interaction effects with other features in the model.

Additionally, the study employed Restricted Cubic Splines (RCS) to explore potential non-linear effects of continuous features on patient prognosis, offering a more nuanced understanding of these features within the model.

Statistical Analysis

Statistical analyses for this study were carried out with the use of R software (version 4.2.1, available at R Project), leveraging two-tailed tests for hypothesis testing, setting the significance level at P < 0.05. The development of the model was facilitated by a collection of R packages, specifically “tidymodels” for streamlined modeling workflows, “glmnet” for regularized regression models including Lasso and Ridge regression, “tidyr” for data tidying, and “ggplot2” for data visualization. Additionally, the interactive web-based model was created utilizing the “shinydashboard” R package, enabling the deployment of a dynamic and user-friendly interface for model interaction and visualization.

Results

Patient Characteristics

The study examined an initial cohort of 3185 patients diagnosed with ASC, adhering to the specified eligibility criteria. From this pool, 2820 individuals were carefully selected based on predefined inclusion parameters for detailed analysis. A flowchart outlining the study population screening, model development, and validation process is available in Supplemental Figure 1.

The demographic and clinical characteristics of this cohort revealed that it predominantly consisted of older individuals, with the average age being over 70 years. The observed mortality rate within this group was notably high, standing at 17.91%. In terms of demographics, the cohort was majorly Caucasian, constituting 83.51% of the patients, with a slight majority of males accounting for 53.69%. A significant portion of the patients, over half, was married, making up 52.62% of the cohort. Clinically, the primary tumors were most frequently found on the right side of the lung (57.34%) and predominantly in the upper lung lobes (60.39%), with occurrences in the lower lobes recorded at 32.20%. The average size of the tumors within the cohort was recorded at 40.43 millimeters, pointing towards the prevalence of relatively large tumors. A substantial portion of the cases, 51.42%, were identified at advanced tumor stages. Metastases were notably present in several key areas: bone (13.51%), brain (8.51%), liver (5.14%), and lung (8.69%). Treatment involving radiotherapy and chemotherapy was a common intervention, administered to 35.5% and 40.28% of the patients in the cohort, respectively.

A detailed presentation of these baseline characteristics is encapsulated in Table 1, providing a comprehensive overview of the patient group under study. Additionally, Spearman correlation analysis conducted on the dataset’s features revealed weak multicollinearity, as visualized in Figure 1, suggesting minimal overlap in the information provided by the different variables in the study’s dataset.

Table 1.

Baseline Characteristics

Features	Total (n = 2820)
Age(years), mean ±SD	70.20 ± 10.17
Tumor Size(mm), mean ±SD	40.43 ± 29.36
Sex, n (%)
Male	1514 (53.69)
Female	1306 (46.31)
Race, n (%)
White	2355 (83.51)
Black	257 (9.11)
Others	208 (7.38)
Marital status, n (%)
Married	1484 (52.62)
Unmarried	1217 (43.16)
Unknown	119 (4.22)
Primary tumor site, n (%)
Main bronchus	58 (2.06)
Upper lobe	1703 (60.39)
Middle lobe	116 (4.11)
Lower lobe	908 (32.20)
Overlapped lesions	35 (1.24)
Tumor grade, n (%)
Grade I	29 (1.03)
Grade II	672 (23.83)
Grade III	1313 (46.56)
Grade IV	38 (1.35)
Unknown	768 (27.23)
Tumor laterality, n (%)
Left	1203 (42.66)
Right	1617 (57.34)
AJCC stage
I	974 (34.54)
II	396 (14.04)
III	567 (20.11)
IV	883 (31.31)
Bone metastasis, n(%)
Yes	381 (13.51)
No	2414 (85.60)
Unknown	25 (0.89)
Brain metastasis, n(%)
Yes	240 (8.51)
No	2553 (90.53)
Unknown	27 (0.96)
Liver metastasis, n(%)
Yes	145 (5.14)
No	2646 (93.83)
Unknown	29 (1.03)
Lung metastasis, n(%)
Yes	245 (8.69)
No	2545 (90.25)
Unknown	30 (1.06)
Surgery of primary site, n(%)
Yes	1396 (49.50)
No	1424 (50.50)
Surgery of other Site(s), n(%)
Yes	114 (4.04)
No	2706 (95.96)
Radiotherapy, n (%)
Yes	1001 (35.50)
None/unknown	1819 (64.50)
Chemotherapy, n(%)
Yes	1136 (40.28)
None/unknown	1684 (59.72)
Status, n(%)
Alive	2315 (82.09)
Dead	505 (17.91)

SD = Standard Deviation.

Figure 1.

Heat Map Illustrating Feature Correlations. SPS: Surgery of Primary Site; SOS: Surgery of Other site(s)

Predictive Feature Identification and Model Development

The study conducted a comprehensive examination of a wide array of features, including demographic details, tumor characteristics, and treatment approaches. These features were treated as covariates in both univariate logistic regression and Lasso analyses. Initial univariate logistic regression analysis, as shown in Table 2, excluded sex, race, tumor grade, tumor laterality, surgery of other site(s), and radiotherapy as standalone predictors of early mortality.

Table 2.

Univariate Logistic Regression Analysis

Features	Odds ratio	95%Confidence interval	P-value
Age	0.98	0.97-0.99	<0.001
Tumor size	0.98	0.98-0.99	<0.001
Sex
Male(reference)
Female	1.2	0.99-1.46	0.06
Race
White(reference)
Black	1.07	0.76-1.5	0.7
Others	1.09	0.75-1.59	0.65
Marital status
Married(reference)
Unmarried	0.78	0.64-0.95	0.01
Unknown	2.18	1.12-4.22	0.02
Primary tumor site
Main bronchus(reference)
Upper lobe	3.75	2.2-6.41	<0.001
Middle lobe	4.73	2.26-9.93	<0.001
Lower lobe	3.27	1.89-5.63	<0.001
Overlapped lesions	3.03	1.14-8.06	0.03
Tumor grade
Grade I(reference)
Grade II	0	0-2.23	0.96
Grade III	0	0-9.38	0.96
Grade IV	0	0-8.64	0.96
Unknown	0	0-4.92	0.96
Tumor laterality
Left(reference)
Right	0.95	0.78-1.15	0.59
AJCC stage
I(reference)
II	0.5	0.32-0.79	<0.001
III	0.36	0.24-0.52	<0.001
IV	0.08	0.06-0.11	<0.001
Surgery of primary site
Yes(reference)
No	8.01	6.18-10.37	<0.001
Surgery of other Site(s)
Yes(reference)
No	1.6	1.04-2.46	0.03
Radiotherapy
Yes(reference)
None/unknown	1.01	0.82-1.23	0.94
Chemotherapy
Yes(reference)
None/unknown	0.36	0.29-0.45	<0.001
Bone metastasis
Yes(reference)
No	4.64	3.67-5.86	<0.001
Unknown	0.94	0.42-2.13	0.88
Brain metastasis
Yes(reference)
No	3.87	2.93-5.11	<0.001
Unknown	0.65	0.29-1.45	0.29
Liver metastasis
Yes(reference)
No	4.93	3.5-6.94	<0.001
Unknown	1	0.45-2.22	1
Lung metastasis
Yes(reference)
No	3.8	2.88-5.01	<0.001
Unknown	0.79	0.37-1.69	0.54

Subsequent Lasso analysis refined the feature selection process, identifying 6 critical features significantly influencing early mortality in ASC patients. These features were tumor size, AJCC stage, surgery of primary tumor site, chemotherapy, and metastases to the bone and liver. The selection process via the Lasso algorithm is illustrated in Figure 2.

Figure 2.

Features Screening Process Executed by the Lasso Algorithm. SPS: Surgery of Primary Site; SOS: Surgery of Other site(s)

Building on this foundation, the study developed a traditional prognostic model using logistic regression. To further enhance the precision of survival predictions, additional prognostic models were crafted using the XGBoost algorithm. This approach aimed to achieve higher accuracy in forecasting patient outcomes, leveraging the strengths of the XGBoost model in handling complex, high-dimensional data and uncovering intricate relationships among prognostic features. Figure 3 provides an in-depth look for the XGBoost model’s performance. This includes details on the hyperparameter tuning process (Figure 3A), outcomes from 10-fold cross-validation (Figure 3B), and an analysis of the precision-recall curves (Figure 3C) generated during the model’s training phase.

Figure 3.

Training Process of the XGboost Model. (A) Hyperparameter tuning; (B) 10-Fold Cross validation; and (C) Precision-Recall curves

Discriminatory Ability of the Predictive Models

The discriminatory abilities of the predictive models were thoroughly assessed across both training and validation datasets, as visually represented in Figure 4A. The logistic regression model demonstrated a respectable ability to predict early mortality, achieving an AUC value of 0.809 (95% Confidence Interval [CI]: 0.794-0.824) within the training set, which slightly diminished to 0.787 (95% CI: 0.751-0.824) in the validation cohort. Compared to this, the AJCC staging model exhibited weaker discrimination capabilities, with an AUC of 0.750 (95% CI: 0.734-0.766) observed in the training dataset and a nearly identical performance in the validation set, where the AUC was 0.768 (95% CI: 0.730-0.81).

Figure 4.

Model Performance Evaluation: (A) AUCs; (B) DCAs; and (C) Calibration Plots for Training and Validation Datasets

Contrastingly, the XGBoost model emerged with significantly superior discriminatory power. It achieved an impressive AUC of 0.967 (95% CI: 0.962-0.973) in the training cohort, and it successfully maintained a strong performance in the validation cohort with an AUC of 0.844 (95% CI: 0.808-0.880). This marked superiority indicates the XGBoost model’s robustness and reliability in predicting early mortality among patients with ASC.

Assessment on Calibration, DCA, and KS

The DCA, presented in Figure 4B, underscores the XGBoost model’s superior prognostic performance compared to both logistic regression and AJCC staging approaches across both cohorts. Calibration plots for both the training and validation cohorts, depicted in Figure 4C, respectively, demonstrate a close alignment between predicted probabilities and actual outcomes, affirming the XGBoost model’s precision. The KS statistic further validates the model’s predictive capability by showcasing its ability to differentiate between patients who experienced early mortality and those who did not. With a KS score of 0.83 for the training cohort and 0.58 for the validation cohort, as depicted in Figure 5, the results demonstrate a pronounced separation between positive and negative outcomes. This underscores the XGBoost model’s strong discriminative power.

Figure 5.

Comparison of Prognostic Models for KS Score Across Both Training and Validation Datasets

Model Interpretation

To further understand the efficacy of the XGBoost model in predicting early mortality among ASC, this study employed SHAP plots to elucidate the importance of different features within the model. Figure 6 showcases a discernible pattern where features with higher SHAP values correlate with an increased risk of adverse prognosis. The color coding within the plot—ranging from red indicating lower values, through purple for values near the mean, to blue for higher values—provides additional insight into feature impact. Notably, this visual analysis underscores the significant influence of AJCC stage, chemotherapy, and tumor size on mortality risk, alongside other critical factors such as surgery of primary site and metastases to the bone and liver.

Figure 6.

Summary Plots of SHAP Values in the XGBoost Model. SPS: Surgery of Primary Site

To offer a deeper understanding of feature impact, a breakdown analysis was conducted on an individual case from the cohort. As illustrated in Figure 7, the analysis begins with the model’s baseline predictive value of 0.177. It demonstrates how certain features—such as the absence of metastases to the bone or liver, surgery of the primary site, a tumor size of 20 millimeters, and an AJCC stage I—negatively contribute to the prediction. Conversely, factors like not receiving chemotherapy positively influenced the prediction, culminating in a final early mortality prediction odds of 0.0454. This individualized breakdown aims to elucidate the significant predictors affecting patient outcomes.

Figure 7.

Breakdown Analysis Utilizing Features Derived From the First Member of the Cohort in the XGBoost Model. SPS: Surgery of Primary Site

Additionally, the study explored partial dependence profiles, presented in Figure 8, which examines the impact of both categorical (Figure 8A) and continuous (Figure 8B) features on the model’s predictions. By isolating the effects of individual features, the analysis identifies chemotherapy and AJCC stage as significant predictors of the model’s output. It also highlights that larger tumor sizes are associated with higher predicted risks. These insights collectively enhance our understanding of the features driving the model’s prognostic predictions, emphasizing the nuanced roles of various clinical and demographic factors in determining patient outcomes.

Figure 8.

(A) Partial Dependence Profiles Showing the Influence of Categorical; and (B) Continuous Features on Model Performance. SPS: Surgery of Primary Site

RCS Analysis on Tumor Size and Mortality

In the RCS analysis of tumor size and early mortality (Figure 9), a clear non-linear relationship emerged. The hazard ratio decreases sharply as tumor size increases from very small dimensions, then stabilizes or slightly rises again around the 44 mm cutoff. This pattern indicates that tumor size exerts a significant, non-linear influence on early mortality risk, underscoring the clinical importance of recognizing size thresholds—particularly around 44 mm—when stratifying risk and guiding treatment decisions in patients with ASC.

Figure 9.

Restricted Cubic Splines Examining the Relationship Between the Tumor Size and Survival Outcome

Development of a Predictive System on a Web Server

Expanding upon the successful implementation of the XGBoost model, this study has ventured into the development of a web-based application designed to facilitate the predictive assessment of early mortality among patients with ASC. This innovative platform is specifically designed to accommodate researchers, including those without extensive experience in machine learning, by offering a streamlined and user-friendly system for initializing, training, and evaluating a XGBoost model.

The application features an intuitive interface, making it straightforward for users to input patient data and receive predictive outcomes. A visual guide to the application’s interface is provided in Supplemental Figure 2. For direct access to this predictive tool, users can visit the following URL: https://the-lungcare-innovators-research-team.shinyapps.io/early-mortality-predictor-asc/. This web-based application serves as a valuable resource for researchers focused on exploring prognostic factors influencing survival rates in patients with ASC, enhancing the accessibility and applicability of advanced predictive modeling in clinical research.

Discussion

This study focused on analyzing early mortality risk from 2820 patients diagnosed with ASC. It found that the XGBoost model significantly outperformed traditional methods, such as logistic regression and AJCC staging, in predicting early mortality among these patients. Notably, a beeswarm summary plot revealed that AJCC stage emerged as the most critical risk feature, followed by chemotherapy, tumor size, surgery of primary site, and metastases to bone and liver. Leveraging the insights gained from the XGBoost model, the research team developed a web-based prognostic tool designed to offer clinicians personalized and actionable insights, potentially revolutionizing patient management approaches. This initiative marks a significant advancement in the utilization of machine learning-driven models for prognostication in patients with ASC, filling a critical void in existing medical literature.

The AJCC staging system, while widely utilized for prognosticating ASC, often does not fully capture the disease’s metastatic potential or the variability in treatment response. Its primary focus on the anatomical extent of tumors overlooks critical biological and molecular factors that could significantly influence prognosis and the efficacy of treatments.^8,9 This limitation underscores the urgent need for more sophisticated predictive tools that integrate a wider array of clinical data, ensuring a higher degree of precision in ASC prognostication. Machine learning emerges as a formidable solution in this context, offering the capability to analyze diverse data types—including clinical attributes, genetic markers, and imaging data—to uncover complex patterns. Such comprehensive analysis facilitates more accurate prognosis predictions, enables the formulation of personalized treatment strategies, and contributes to the identification of novel therapeutic targets, ultimately enhancing patient outcomes and propelling forward the realm of cancer research.¹⁰ However, despite the promise shown by machine learning and other advanced analytical methods, there persists a significant challenge in developing models that are specifically tailored for predicting early mortality in patients with ASC. This gap points to a critical need for continued efforts in model development and validation, aimed at optimizing the predictive accuracy and clinical utility of prognostic tools for this complex and heterogeneous disease.

Numerous studies have delved into the survival outcomes of individuals diagnosed with ASC, shedding light on various factors that influence prognosis. For instance, Liang et al¹¹ identified age, sex, the primary site of the tumor, histological grade, TNM stage, and the application of surgery and chemotherapy as independent factors affecting ASC outcomes, achieving concordance indices (C-indices) of 0.75 with their model. Similarly, Wu et al¹² found that factors such as increasing age, male sex, absence of surgery, and advanced TNM stages significantly contributed to poorer outcomes, with their model reaching 3-year and 5-year C-indices of 0.79. Furthermore, Wang et al.’s study¹³ pinpointed increasing age, male sex, invasion through the visceral pleura, poor differentiation, and higher stage as independent risk factors impacting the prognosis of surgical patients. Shi et al¹⁴ concentrated on identifying 4 clinical parameters - primary tumor site, AJCC stage, T stage, and surgery - as predictors of patient outcomes. However, the studies by Wang and Shi did not result in a predictive model, possibly due to limitations in data features. While these models offer valuable insights and demonstrate robustness, they are generally targeted at the broader ASC patient population and do not provide personalized predictions for early mortality. In contrast, our study is focused precisely on forecasting early mortality among patients with ASC, aiming to fill a specific niche with enhanced accuracy. By incorporating a wide range of geographical data and employing advanced machine learning techniques, our model not only promises greater precision in predicting early mortality within this particular cohort but also holds the potential for broader applicability in future prognostic assessments.

The prognostic evaluation of ASC patients conducted in this study highlighted 6 independent predictive features identified by the Lasso algorithm, corroborating findings from prior research.^12,15 With the rise of targeted therapies in oncology, the genetic mutation status of key driver oncogenes has drawn increasing attention in ASC due to its potential prognostic relevance. For instance, activating mutations in the Epidermal Growth Factor Receptor (EGFR) gene have been observed in approximately 30%-50% of ASC cases. Despite this prevalence, the impact of EGFR mutations on ASC prognosis remains a topic of debate, and their specific prognostic value has yet to be thoroughly investigated.^16-18 Similarly, Kirsten rat sarcoma viral oncogene homolog (KRAS) mutations have been identified in about 5%-10% of cases, according to several studies.^19,20 However, there is scant evidence linking KRAS mutations directly to ASC patient prognosis.¹¹ In addition to genetic alterations, the Tumor MicroEnvironment(TME) in NSCLC functions as a dynamic ecosystem where various immunosuppressive mechanisms converge to promote immune evasion and resistance to therapy. Despite subtype-specific differences, common processes such as immune cell dysfunction, metabolic reprogramming, and structural barriers collectively contribute to tumor progression. These complex interactions underscore the critical role of the tumor microenvironment in shaping disease behavior and therapeutic response.^21,22 Nevertheless, the absence of molecular and genetic profiling in the current database restricts our ability to fully evaluate the influence of these factors on ASC prognosis. Future research that combines both clinical characteristics and genomic data may further improve the predictive performance of the model.

In addition, the RCS analysis revealed a notable non-linear relationship between tumor size and early mortality, with a threshold at approximately 44 mm. Tumors below this size tended to show a lower hazard ratio, while those exceeding it exhibited an elevated risk of mortality. This finding underscores the potential importance of tumor size as a key prognostic indicator and highlights the limitations of relying solely on traditional linear assumptions. Incorporating this non-linear effect into clinical decision-making may significantly refine risk stratification and optimize treatment strategies, ultimately improving patient outcomes in ASC. However, given the rarity of ASC, robust evidence supporting a precise tumor size threshold is still lacking. Therefore, caution is warranted when interpreting the non-linear impact of tumor size on prognosis, and further studies with larger datasets are needed to confirm these observations.

To ensure the robustness of our predictive model, we employed a 10-fold cross-validation technique, crucial for mitigating overfitting and evaluating the model’s generalizability across diverse patient groups. Calibration curves further attested to the reliability of the XGBoost model, showcasing a close match between the predicted probabilities and observed survival outcomes, indicating the model’s accuracy in reflecting real-world scenarios.

The model’s validity is also supported by an impressive KS score, which signifies its superior performance in distinguishing between patient outcomes compared to other models. Demonstrating clinical utility, DCA indicated that our model offers greater net benefits in predicting outcomes for both the training and validation cohorts than conventional models. These outcomes highlight the model’s potential for clinical implementation, suggesting it could significantly refine decision-making processes and subsequently enhance patient care.

While the development of this XGBoost model tailored to a specific patient subset demonstrates promising predictive capabilities, several limitations must be acknowledged. The retrospective nature of the study introduces the potential for selection bias. Although the SEER database provides a large and comprehensive dataset, it lacks important clinical features such as smoking status, specific treatment regimens, laboratory values, comorbidities and genetic profiles. Lastly, although LASSO regularization and cross-validation were applied to reduce the risk of overfitting, external validation using independent cohorts is still necessary to confirm the model’s generalizability.

Conclusions

This study presents a novel XGBoost-based model for predicting early mortality in patients with ASC, offering clinicians a robust and practical tool to inform personalized treatment decisions and optimize follow-up strategies. By utilizing machine learning to address a clinically challenging population, our model represents a meaningful step toward precision oncology for ASC. Future work will focus on expanding the model through multicenter validation, integrating molecular and tumor-related biomarkers, and updating the framework to further improve its utility and generalizability in real-world clinical settings.

Supplemental Material

Supplemental Material - Integrating Machine Learning for Early Mortality Prediction in Lung Adenosquamous Carcinoma: A Web-Based Prognostic Model

Supplemental Material for Integrating Machine Learning for Early Mortality Prediction in Lung Adenosquamous Carcinoma: A Web-Based Prognostic Model by Min Liang, Xiaocai Li, Shangyu Xie, Xiaoying Huang, and Shifan Tan in Cancer Control

Supplemental Material

Supplemental Material - Integrating Machine Learning for Early Mortality Prediction in Lung Adenosquamous Carcinoma: A Web-Based Prognostic Model

Footnotes

Acknowledgments

The authors thank Mrs. Yunru Fan and Dr Alexandra Lam for coordinating and supporting the development and preparation of the manuscript.

ORCID iD

Min Liang

Ethical Consideration

This study is based on open-access databases and does not involve new research with human participants or animals. It was reviewed and exempted by the Medical Ethics Committee of Maoming People’s Hospital.

Informed Consent

Informed consent was waived as the research utilized publicly available, de-identified data.

Author Contributions

ML and SF.T participated in the initial conception of the study design. ML, SY.X and XY.H participated actively in the data collection and analysis. ML and XC.L contributed to the interpretation of the results. ML drafted the article and all other authors made critical revisions, introducing important intellectual content. All authors have read and agreed to the published version of the manuscript.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The funding was provided by the High-level Hospital Construction Project of Maoming People’s Hospital, Science and Technology Innovation Development Program of Maoming City (2024kjcxLX056), the Medical Research Fund of Guangdong Province (A2024528), the Research Project of Maoming Science and Technology Bureau (Grant No. 2021121), and the Outstanding Young Talents Program of Maoming People’s hospital (SY2021021).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets generated during and analyzed during the current study are available in the Surveillance, Epidemiology, and End Results (SEER) repository [].

Supplemental Material

Supplemental material for this article is available online.

Appendix

References

Bray

Laversanne

Sung

, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2024;74:229-263. doi:10.3322/caac.21834

Nicholson

Tsao

Beasley

, et al. The 2021 WHO classification of lung tumors: impact of advances since 2015. J Thorac Oncol. 2022;17:362-387. doi:10.1016/j.jtho.2021.11.003

Cooke

Nguyen

Yang

Chen

Calhoun

. Survival comparison of adenosquamous, squamous cell, and adenocarcinoma of the lung after lobectomy. Ann Thorac Surg. 2010;90:943-948. doi:10.1016/j.athoracsur.2010.05.025

Filosso

Ruffini

Asioli

, et al. Adenosquamous lung carcinomas: a histologic subtype with poor prognosis. Lung Cancer. 2011;74:25-29. doi:10.1016/j.lungcan.2011.01.030

Fan

Xue

, et al. iMLGAM: integrated Machine Learning and Genetic Algorithm-driven Multiomics analysis for pan-cancer immunotherapy response prediction. Imeta. 2025;4:e70011. doi:10.1002/imt2.70011

Terranova

Venkatakrishnan

. Machine learning in modeling disease trajectory and treatment outcomes: an emerging enabler for model-informed precision medicine. Clin Pharmacol Ther. 2023;115:720-726. doi:10.1002/cpt.3153

Collins

Reitsma

Altman

Moons

KGM

. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMC Med. 2015;350:g7594. doi:10.1136/bmj.g7594

Sobin

Gospodarowicz

Wittekind

. TNM Classification of Malignant Tumours. John Wiley & Sons; 2011.

Liang

Chen

Singh

, et al. Prognostic nomogram for overall survival in small cell lung cancer patients treated with chemotherapy: a SEER-based retrospective cohort study. Adv Ther. 2022;39:346-359. doi:10.1007/s12325-021-01974-6

10.

Kourou

Exarchos

Karamouzis

Fotiadis

. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2015;13:8-17. doi:10.1016/j.csbj.2014.11.005

11.

Liang

Sui

Zheng

, et al. A nomogram to predict prognosis of patients with lung adenosquamous carcinoma: a population-based study. J Thorac Dis. 2020;12:2288-2303. doi:10.21037/jtd.2020.03.115

12.

Petersen

, et al. A competing risk nomogram predicting cause-specific mortality in patients with lung adenosquamous carcinoma. BMC Cancer. 2020;20:429. doi:10.1186/s12885-020-06927-w

13.

Wang

Zhou

Wang

, et al. Clinicopathological characteristics and prognosis of resectable lung adenosquamous carcinoma: a population-based study of the SEER database. Jpn J Clin Oncol. 2022;52:1191-1200. doi:10.1093/jjco/hyac096

14.

Shi

Shao

Zhang

Tao

. Tumor location and survival outcomes in lung adenosquamous carcinoma: a propensity score matched analysis. Med Sci Monit. 2020;26:e922138. doi:10.12659/msm.922138

15.

Xie

. Metastatic pattern and prognosis in patients with lung adenosquamous carcinoma: a surveillance, epidemiology, and end results-based population study. Heliyon. 2024;10:e30641. doi:10.1016/j.heliyon.2024.e30641

16.

Morodomi

Okamoto

Takenoyama

, et al. Clinical significance of detecting somatic gene mutations in surgically resected adenosquamous cell carcinoma of the lung in Japanese patients. Ann Surg Oncol. 2015;22:2593-2598. doi:10.1245/s10434-014-4218-0

17.

Zheng

Fan

Liu

. Risk factors of postoperative recurrence and potential candidate of adjuvant radiotherapy in lung adenosquamous carcinoma. J Thorac Dis. 2020;12:5593-5602. doi:10.21037/jtd-20-1979

18.

Song

Lin

Shao

Zhang

. Therapeutic efficacy of gefitinib and erlotinib in patients with advanced lung adenosquamous carcinoma. J Chin Med Assoc. 2013;76:481-485. doi:10.1016/j.jcma.2013.05.007

19.

Tochigi

Dacic

Nikiforova

Cieply

Yousem

. Adenosquamous carcinoma of the lung: a microdissection study of KRAS and EGFR mutational and amplification status in a western patient population. Am J Clin Pathol. 2011;135:783-789. doi:10.1309/ajcp08iqzaogylfl

20.

Jia

Chen

. EGFR and KRAS mutations in Chinese patients with adenosquamous carcinoma of the lung. Lung Cancer. 2011;74:396-400. doi:10.1016/j.lungcan.2011.04.005

21.

Bagaev

Kotlov

Nomie

, et al. Conserved pan-cancer microenvironment subtypes predict response to immunotherapy. Cancer Cell. 2021;39:845-865.e847. doi:10.1016/j.ccell.2021.04.014

22.

Zhang

Yang

Liu

, et al. Deciphering lung adenocarcinoma evolution: integrative single-cell genomics identifies the prognostic lung progression associated signature. J Cell Mol Med. 2024;28:e18408. doi:10.1111/jcmm.18408

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.12 MB

0.10 MB