Abstract
Introduction
Accurate machine learning-based prognostic models for the diagnosis and treatment of extensive-stage small cell lung cancer (ES-SCLC) are currently lacking, and the role of radiotherapy in ES-SCLC remains a subject of ongoing debate.
Methods
This study used data from the Surveillance, Epidemiology, and End Results (SEER) database of patients diagnosed with ES-SCLC. Cox regression analysis was performed to identify the key prognostic factors. Six machine learning models were developed: XGBoost, support vector machine, k-nearest neighbors, random forest, Iterative Dichotomiser 3, and logistic regression. External validation was conducted using the medical records of ES-SCLC patients who met the screening criteria at a local hospital. Propensity score matching was applied to address baseline imbalance. Kaplan–Meier (K-M) survival analysis was used to evaluate the prognostic impact of radiotherapy, followed by stratified K-M analysis to further explore its applicability across subgroups.
Results
The analysis revealed that radiotherapy, chemotherapy, and liver metastasis were significantly associated with prognosis (P < .001). Liver metastasis was an independent risk factor of poor survival. The stratified K-M analysis suggested that radiotherapy may benefit certain patient subgroups.
Conclusion
This study provides novel insights into radiotherapy indications for ES-SCLC, contributing to improved clinical guidelines and treatment strategies based on machine learning-derived prognostic models.
Introduction
Small cell lung cancer (SCLC) is a highly aggressive neuroendocrine cancer, 1 accounting for approximately 20%-25% of all bronchogenic tumors. 2 It is most often characterized by high proliferation and metastasis rates, poor prognosis, and a high degree of sensitivity to chemoradiotherapy. SCLC can be classified as either limited or extensive stage. At the time of diagnosis, more than 60% of patients have already entered the extensive stage, 3 which is associated with a very poor prognosis, significantly decreased survival rate compared to the limited stage, and 2-year survival rate of only 7-8%. 4 Patients with extensive-stage small cell lung cancer (ES-SCLC) are often deeply concerned about their survival; therefore, doctors need to estimate the prognosis of these patients to guide treatment planning. Evaluating the prognostic factors and constructing an accurate prediction model are crucial.
Predictive models can not only estimate prognosis, but also help explore clinical characteristics related to ES-SCLC outcomes. Although previous studies have established nomograms for predicting ES-SCLC prognosis, 5 machine learning has the potential to improve prediction accuracy compared to traditional methods.
Artificial intelligence (AI) provides valuable support for clinical decision-making. For instance, Meng et al. developed a machine learning model enabling the early diagnosis and treatment of lung cancer, which may ultimately reduce lung cancer-related mortality. 6 Similarly, Gao et al. built an artificial neural network that effectively predicted distant metastasis in lung cancer. 7 Therefore, machine learning can play a key role in the development of prognostic models. As a multidimensional data tool, 8 AI enables the development of more accurate prediction models than traditional nomograms. However, machine learning has not been widely applied to predict the prognosis of patients with ES-SCLC.
Furthermore, the indications for radiotherapy in patients with ES-SCLC have not been thoroughly analyzed in previous studies. 9 Therefore, this study aimed to analyze radiotherapy indications in ES-SCLC to improve clinical diagnosis and treatment strategies. Using real-world data from Asian patients in the Surveillance, Epidemiology, and End Results (SEER) database, we analyzed the prognostic factors and established a high-accuracy machine learning model to predict 1-, 2-, and 3-year survival rates. This study contributes to the integration of AI into clinical practice and holds promise for improving clinical decision-making and patient outcomes.
Materials and Methods
Data Sources and Study Design
Patient data from the SEER database from 2010 to 2015 were reviewed to facilitate rigorous screening of patient data (Figure 1). The inclusion criteria were as follows: ①(1) SCLC as the only primary tumor; (2) all patients exhibited histopathological and morphological evidence of SCLC in accordance with the International Classification of Diseases for Oncology, Third Edition; (3) distant metastases were present in all patients; and (4) patients had clearly defined pathological types of SCLC. The exclusion criteria were as follows: (1) patients with two or more primary tumors; (1) patients with limited-stage SCLC; (1) patients without clear survival data; and (1) those with incomplete data. Handling of missing data: For variables with a high missing rate (e.g., “CS tumor size” with >40% missing), we chose to exclude them based on their potential impact on model reliability and training outcomes. The Technology Roadmap for this Study.
Univariate and multivariate Cox analyses of the clinical features of the patients obtained from the SEER data were performed. Statistically significant features from the multivariate Cox analyses were incorporated into six machine learning models: XGBoost, random forest, k-nearest neighbor (KNN), Iterative Dichotomiser 3 (ID3), support vector machine (SVM), and logistic regression (LR) to predict the overall survival (OS) of patients with ES-SCLC at 1, 2, and 3 years. Before training, survival data were collected as the response variable. The patients were then randomized into training and test groups in a 7:3 ratio. In this study, the receiver operating characteristic (ROC) curve and area under the ROC curve (AUC) were used to evaluate the models, and the AUC values of each of the six models for the test data were compared. Subsequently, we employed 1:1 propensity score matching (PSM) to adjust for imbalances in clinical characteristics between patients who received radiotherapy and those who did not and performed Kaplan–Meier (K-M) survival and stratified analyses.
Machine learning models were constructed using Python (version 3.11.5), XGBoost 2.0.3, and scikit-learn 1.3.0. Multivariate Cox regression model analyses, PSM, and survival analyses were performed using R software (version 4.3.0). Statistical significance was defined as P < .05. This study was reported in accordance with the TRIPOD guidelines. 10
Model Validation
To further validate the accuracy of the XGBoost prognostic model, we collected data from 34 patients diagnosed with ES-SCLC between September 2018 and September 2023 at the Weifang Hospital of Traditional Chinese Medicine. The exclusion criteria were as follows: (1) multiple primary tumors, (2) limited-stage SCLC, (3) loss to follow-up, (4) lack of a clear date of death, and (5) incomplete clinical data. This retrospective cohort study was approved by the Institutional Review Board of the Medical Research Ethics Committee of Weifang Traditional Chinese Medicine Hospital (ethical approval number: 2025YX391). Informed consent was obtained from all the patients in the expanded cohort.
Machine-Learning Models
XGBoost, random forest, KNN, ID3, SVM, and LR are widely used machine learning models for binary classification tasks. Among them, XGBoost has exhibited the best performance. It approximates the loss function using a second-order Taylor expansion and optimizes the model using both the first-order and second-order derivatives of the loss function. To build decision trees, XGBoost uses a greedy strategy to select optimal split points and features, thereby minimizing the loss function. To prevent overfitting and enhance the generalization ability of the model, several techniques are employed, including regularization, learning rate adjustment, column sampling, and approximation of the optimal split points. Regularization adds a term to the loss function, serving as pre-pruning to control the model complexity. The learning rate determines the contribution of each tree to the final model, with smaller rates often leading to better generalizations. Column sampling involves stratified and random feature sampling before constructing the tree, thereby increasing diversity and reducing overfitting. Approximate algorithms improve the computational efficiency by identifying optimal split points without overfitting. By combining these strategies, XGBoost demonstrates robust performance in practical applications, making it a powerful tool for machine learning.
The XGBoost algorithm operates as follows (where feature vector corresponds to category):
Results
Clinical Characteristics of Patients with ES-SCLC
Baseline Characteristics of Patients With ES-SCLC in the SEER Cohort.
Univariate and Multivariate Cox Regression Analysis
Univariate and Multivariate Cox Analyses of Extracted Features From the SEER Database.
n = 2065,events = 2065,Likelihood ratio test = 757.94 on 8 df(P < 0.001).
Building and Evaluating Predictive Models
Evaluating the prognosis of patients with ES-SCLC
XGBoost Hyperparameter Grid Search Space.
Main Parameters of the XGBoost Model.

Evaluation of the XGBoost Model. A, B, and C Represent the ROC Curves for the Validation set for Years 1, 2, and 3, Respectively; D, E, and F Represent the ROC Curves for the Training set for Years 1, 2, and 3, Respectively.
Performance of Predictive Models Built Using Machine Learning Algorithms on the Test Dataset (Area Under the ROC Curve).
To further test the efficacy of the XGBoost model developed in this study, we collected clinical data (including prognostic information) from 34 patients with ES-SCLC at our hospital. We found that the XGBoost model still demonstrated good accuracy in the external independent data validation (1-year AUC = 0.71, 2-year AUC = 0.78, and 3-year AUC = 0.89) (Figure 3). XGBoost Model Validation Against the External Data. A, B, and C Show the ROC Curves for Years 1, 2, and 3 of Externally Validated Medical Records, Respectively.
In addition, we analyzed the importance of the features in the model. The study results revealed that for patients with a 1-year survival period (Figure 4A), the top five prognostic factors influencing survival time were chemotherapy, radiotherapy, liver metastasis, age, and brain metastasis. The top five prognostic factors for patients with survival periods of 2 (Figure 4B) and 3 years (Figure 4C) were liver metastasis, chemotherapy, age, sex, and radiotherapy. In the 1-year prognostic model, chemotherapy was the most important influencing factor, followed by radiotherapy. Liver metastasis and chemotherapy were more important in the 2- and 3-year prognostic models. Importance Ranking of Features in the XGBoost Model. A, B, and C Show the Importance Ranking of Features in 1, 2, and 3 Years, Respectively.
Post-PSM Stratification Analysis
Comparison of Characteristics of Patients on Radiotherapy Before and After PSM Adjustment.
By combining the PSM-adjusted dataset with the K-M survival curve (Figure 5) (P < .001), we concluded that radiotherapy is beneficial for improving the prognosis and survival of patients. Stratified analysis based on multivariable Cox proportional hazards regression yielded the following results (Figure 6). In terms of treatment, patients who received chemotherapy had the best short- and long-term prognoses. Regarding the site of distant metastasis, the short- and long-term prognoses were better when there was no metastasis to the liver, brain, or bones. However, when lung metastasis occurred, it appeared to have no effect on the short- or long-term survival. Female patients had a better long-term prognosis, with no difference observed between males and females in terms of short-term prognosis. No significant differences in age were observed between the groups. Patients aged 60-69 years had the best prognosis for short-term survival, whereas patients aged 50-59 years had the best prognosis for long-term survival. Patients over 80 years old had the worst prognosis for both short- and long-term survival. In terms of the primary tumor site, patients with tumors in the middle lung lobe had the best prognosis for long-term survival, whereas patients with overlapping lung lesions had the worst prognosis, and patients with tumors in the main bronchus had the worst prognosis. OS of PSM-Adjusted Radiotherapy Patients. Stratified Results of Multifactorial Cox Regression Analysis of OS in PSM-Adjusted Radiotherapy Patients. (A) Shows the Relationship Between Chemotherapy and Radiotherapy; B, liver Metastasis and Radiotherapy; C, Brain Metastasis and Radiotherapy; D, Bone Metastasis and Radiotherapy; E, Lung Metastasis and Radiotherapy; F, sex and Radiotherapy; G, Age and Radiotherapy; and H, Primary Tumor Location and Radiotherapy.

Discussions
SCLC is a neuroendocrine malignancy characterized by rapid progression and poor prognosis. Unlike non-small cell lung cancer, SCLC is a systemic disease classified into limited and extensive stages. At the time of initial diagnosis, approximately two-thirds of patients present with distant metastasis. 11 Based on the clinical experience, treatment approaches, and relevant literature, this study highlights three main points. First, previous studies have not utilized machine learning techniques to establish prognostic models for SCLC. Instead, most of these studies have focused on nomogram construction. Treatments for ES-SCLC differ significantly from those for limited-stage SCLC; therefore, it is crucial to analyze ES-SCLC separately in prognostic models. Additionally, predicting the survival time is a primary concern for patients with ES-SCLC. Therefore, more accurate predictive models for ES-SCLC are required. Recent nomograms for ES-SCLC also lack external data validation, and many studies have had incomplete follow-up periods, often analyzing outcomes only at 1, 2, or 3 years. Second, while numerous studies have focused on surgical treatments for limited-stage lung cancer, 12 few have emphasized radiotherapy as the primary research focus in SCLC. Previous clinical studies have suggested that radiotherapy has minimal value in patients with distant metastases, primarily as a palliative treatment for pain relief. However, the present study challenged this perspective by investigating whether radiotherapy could significantly improve the prognosis of these patients. Finally, the current Chinese Society of Clinical Oncology (CSCO) guidelines for SCLC do not address key factors, such as liver metastasis, lung metastasis, sex, age, or primary site of the tumor. 9 To refine the indications for radiotherapy in ES-SCLC, this study employed methods such as PSM and K-M survival stratification analysis.
In the present study, we developed the first AI-based prognostic model for patients with ES-SCLC. The model was iteratively tuned to determine the optimal parameters for 1-, 2-, and 3-year survival predictions, ensuring high accuracy for each time point. Among the six machine learning models evaluated, including XGBoost, random forest, KNN, ID3, SVM, and LR, XGBoost and LR demonstrated robust performance, with AUC values >0.7. Although both models had identical AUC values (0.71) for the 3-year predictions, XGBoost outperformed LR for the 1- and 2-year predictions, making it the most accurate model for predicting ES-SCLC survival. The XGBoost model exhibited excellent performance during independent external data validation, highlighting its clinical utility. Several independent prognostic factors were identified, including sex, radiotherapy, chemotherapy, and liver metastasis.
Regarding the general clinical data, our study confirmed that sex is an independent prognostic factor for ES-SCLC patients. This finding aligns with research by Zhong et al, which showed that male patients generally have poorer prognoses and higher mortality than females. 13 This disparity may be attributed to lifestyle factors, such as smoking rates, hormone levels, and treatment tolerance, explaining why sex ranks highly in the model’s importance. Among the metastatic sites, our multivariate Cox regression analysis revealed that liver metastasis was the only independent factor negatively affecting survival, which is consistent with the findings of Wu et al. 14 Given that the liver is an important metabolic and immune organ, its involvement suggests a high tumor burden and rapid systemic progression, which may explain its significant impact on prognosis.
In terms of treatment, chemotherapy, a cornerstone of ES-SCLC treatment, significantly affects prognosis when omitted. Approximately 20% of SCLC patients do not receive chemotherapy, 15 and failure to do so markedly reduces survival benefits. 16 Radiotherapy, although historically considered a palliative option for distant metastases, emerged as an independent favorable factor in our analysis. This finding supports Longo et al.’s description of the role of radiotherapy in improving symptoms and controlling disease progression. 17 Despite the low recommendation level (2C) for radiotherapy in the current guidelines, 16 recent studies, including those by Deng et al. and Yuan et al.18,19 have suggested that appropriate radiotherapy timing and fractionation can significantly extend OS. Han et al also identified hypofractionated radiotherapy (45 Gy in 30 fractions) as effective in improving the OS of ES-SCLC patients. 18
To determine when radiotherapy improves the prognosis of ES-SCLC patients, we investigated the effects of radiotherapy across patient subgroups. After balancing confounding factors via PSM, we identified three key findings from the K-M curve stratification. First, radiotherapy combined with chemotherapy significantly improved survival in ES-SCLC patients. However, for patients with the potential for local control, individualized radiotherapy regimens should be prioritized over systemic chemotherapy. Second, our study suggests that high-risk patients without bone, liver, or brain metastases may benefit from preventive or consolidative radiotherapy, as suggested by O'Brien et al and Slotman et al.20,21 Finally, our results showed that females, patients aged 50-69 years, and those with primary lesions in the middle lung lobe or main bronchus (excluding lung metastasis) were more likely to benefit from radiotherapy.
These findings have several translational implications for clinical practice. We recommend that the CSCO guidelines be updated to include specific recommendations for radiotherapy based on anatomical and demographic factors, with increased strength of recommendation for certain subgroups. Additionally, we propose the development of a “radiotherapy benefit prediction model” within multidisciplinary tumor diagnosis and treatment teams. This model would help clinicians make more informed decisions by integrating patient characteristics such as age, sex, and metastatic status into a visual decision tree. Finally, we suggest implementing a dynamic evaluation mechanism for radiotherapy indications through prospective data collection, allowing continuous optimization of treatment strategies.
Despite these promising findings, this study had several limitations. First, although immunotherapy has become a first-line treatment for ES-SCLC, we could not include this factor because of limitations in the available database and incomplete clinical adoption of immunotherapy in some regions. This omission may have reduced the relevance of the model to current treatment outcomes. Second, the SEER database lacks detailed treatment information (e.g., drug type, dosage, and treatment cycle), which limits our ability to apply deconvolution tools (such as CIBERSORTx and MuSiC) to analyze treatment intensities and combinations. Moreover, inherent biases in retrospective studies, such as selection bias and missing data, may have affected the study accuracy. Finally, this study has used the Chinese cohort as an external validation set, partially completing the validation. However, to further improve the generalizability of the research results, further validation is still needed through retrospective or prospective studies across more centers and diverse ethnic groups.
Conclusion
The machine learning prognostic model for ES-SCLC developed in this study demonstrated high reproducibility. We identified key prognostic factors for patients with ES-SCLC and strongly recommend radiotherapy for those meeting at least one of the following criteria: predicted survival of approximately 1 year; absence of liver, brain, or bone metastases; female sex; age 50-69 years; or primary tumors located in the middle lobe of the lung or main airways. This study enhances the indications for radiotherapy outlined in the ES-SCLC guidelines.
Footnotes
Author Note
All authors of this paper have read and approved the final version submitted.
Acknowledgements
We appreciate the efforts of the SEER tumor registry teamin establishing the database.
Ethical Statement
Ethical Approval
The study followed the Declaration of Helsinki and was approved by the Weifang Hospital of Traditional Chinese Medicine (Ethics Approval No. 2025YX391; Approval date: 4/25/2025). Informed consent was obtained from each patient in the expanded group.
Author Contributions
Conceptualization, H.W. and H.Z.; Data curation,L.W.; Formal analysis, R.L.; Funding acquisition, C.S. and J.Z.; Investigation, H.W.; Methodology, Y.Y.; Supervision, J.Z.; Visualization, H.Z.; Writing – original draft, H.W. and H.Z.; Writing – review & editing, C.S. and J.Z..Final approval of manuscript, All authors.All authors have read and agreed to the published version of the manuscript.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Shandong Taishan scholars specially invited expert talent project (tstp20221166 to C.S.) and Shandong Province Natural Science Foundation innovation and development joint fund (ZR2023LZY006 to J.Z.).
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
