Abstract
Background
Primary gastric diffuse large B-cell lymphoma (PG-DLBCL) exhibits heterogeneous outcomes, and conventional prognostic systems often fail to capture its unique clinicopathological features. We aimed to develop a robust, interpretable prognostic model to improve risk stratification and guide individualized management.
Methods
Data from 3773 PG-DLBCL patients (2000–2021) were extracted from the SEER database. Four complementary feature selection strategies—LASSO, Boruta, backward stepwise elimination, and best subset regression (BSR)—were employed to identify stable prognostic variables. Four machine learning (ML) algorithms (logistic regression, support vector machine, k-nearest neighbor, and XGBoost) were trained using these selected variables. Model performance was evaluated through discrimination (AUC), calibration, decision curve analysis, and internal validation. Shapley Additive Explanations (SHAP) were applied for interpretability, and patients were stratified into high- and low-risk groups.
Results
Age, stage, chemotherapy, marital status, and income emerged as key prognostic determinants. The XGBoost model based on BSR-selected predictors achieved the highest performance with good calibration and net clinical benefit. SHAP analysis demonstrated that older age, advanced stage and absence of chemotherapy increased predicted risk, whereas marital status and higher income were protective. Risk stratification effectively distinguished survival outcomes in training and testing cohorts (p < 0.001). A web-based tool was developed for individualized risk assessment.
Conclusions
We established an interpretable, high-performing ML-based prognostic model for PG-DLBCL that integrates clinical, treatment, and socioeconomic factors. This tool enables precise risk stratification, supports individualized therapeutic decisions, and provides a methodological framework for prognostic modeling in this rare extranodal lymphomas.
Introduction
Primary gastric diffuse large B-cell lymphoma (PG-DLBCL) is the most frequent subtype of primary gastric lymphoma and constitutes the majority of extranodal diffuse large B-cell lymphoma cases.1–3 Although the introduction of rituximab-based immunochemotherapy has significantly improved survival outcomes, the prognosis of PG-DLBCL remains highly heterogeneous due to variations in age, disease stage, tumor burden, and molecular characteristics. Conventional prognostic systems, such as the International Prognostic Index (IPI) and Ann Arbor stage, were developed for nodal DLBCL and may not adequately capture the unique clinicopathological features of gastric involvement.2,4–8 Consequently, a substantial proportion of patients are either under- or overestimated in terms of risk, which may lead to suboptimal treatment decisions.
Given these challenges, the development of a robust prognostic model specifically tailored for PG-DLBCL is of critical importance. Such a model would enable accurate survival prediction, refine risk stratification, and facilitate individualized therapeutic strategies. Moreover, it would provide clinicians with an evidence-based tool to identify high-risk patients who may benefit from more intensive treatment and close surveillance, while sparing low-risk patients from unnecessary toxicity. In the era of precision medicine, constructing reliable and validated prognostic models is not only essential for guiding clinical management but also for informing clinical trial design and advancing translational research in PG-DLBCL.
In recent years, population-based cancer registries such as the Surveillance, Epidemiology, and End Results (SEER) database have provided unprecedented opportunities to explore prognostic determinants across large and diverse cohorts. However, conventional regression-based approaches applied to SEER data often struggle with high-dimensionality, multicollinearity, and the presence of complex interactions. This raises the necessity for advanced statistical and machine learning (ML) approaches capable of handling multifaceted prognostic landscapes. 9 ML methods have emerged as powerful tools for survival prediction and risk modeling in oncology. By leveraging non-linear modeling capacity, automatic feature selection, and ensemble learning strategies, ML algorithms have demonstrated superiority over traditional methods in predictive performance.9–13 Recent studies, such as the work by Ismayilov et al. on predicting CNS involvement, demonstrate the ability of ML to handle complex clinical scenarios. 14 Nonetheless, most prior studies on DLBCL prognosis have focused either on nodal disease or single-algorithm applications,7,8,15–17 with few efforts devoted to PG-DLBCL specifically. Additionally, systematic integration of multiple variable selection strategies with multi-model ML frameworks remains scarce, potentially limiting robustness and generalizability.
To bridge these gaps, the present study aimed to establish a comprehensive prognostic evaluation and risk stratification tool for patients with PG-DLBCL using population-based SEER data. By integrating multi-strategy variable selection methods with diverse ML algorithms, we endeavor to (i) identify key clinical and pathological determinants of PG-DLBCL survival with high robustness, (ii) construct and compare multiple predictive models to maximize prognostic accuracy, and (iii) develop a clinically applicable risk stratification framework to guide individualized patient management. This integrative approach is expected to offer a novel methodological paradigm for prognostic modeling in rare extranodal lymphomas and ultimately facilitate precision oncology decision-making in PG-DLBCL.
Methods
Data source
Patient information for PG-DLBCL diagnosed between 2000 and 2021 was retrieved from the SEER program of the National Cancer Institute through SEER*Stat software. The SEER program is a publicly available, population-based cancer registry with rigorous quality control, which ensures large-scale, representative, and long-term survival data. Because the dataset is de-identified and freely accessible, separate institutional ethical approval and informed consent were not required.
A total of 3773 eligible cases were identified according to tumor site codes C16.0-C16.9 and the ICD-O-3 histological code 9680/3. Clinical and demographic variables were extracted, including age at diagnosis, sex, race, marital status, socioeconomic indicators (county-level median household income and rural/urban status), tumor site, Ann Arbor stage, time from diagnosis to initial treatment, treatment modalities [surgery, chemotherapy (CT), radiotherapy (RT)], and survival outcomes. Patients were excluded if they (1) did not have a pathologically confirmed diagnosis, (2) were not diagnosed with PG-DLBCL as their first primary malignancy, (3) had incomplete or unavailable follow-up information, or (4) lacked sufficient data on essential clinicopathological characteristics.
Sensitivity analyses were performed to explore the optimal partitioning strategy for the derivation cohort. Across all tested split ratios, the area under the curve (AUC) consistently exceeded 0.75 (Supplemental Figure 1), indicating stable generalization capacity of the models. Based on this robustness and to balance adequate sample size for model training with sufficient cases for internal validation, the cohort was ultimately divided into a training set and a testing set in a 7:3 ratio.
Variable screening
To ensure that the prognostic model was constructed on robust and informative predictors, we applied four complementary feature selection strategies. The Least Absolute Shrinkage and Selection Operator (LASSO), under the λ1se penalty, reduced dimensionality and retained six covariates with the strongest predictive value. The Boruta algorithm, through iterative permutation and ranking, identified five stable variables as essential features. Backward stepwise elimination yielded seven predictors, while best subset regression (BSR), guided by the adjusted R2 criterion, converged on a parsimonious set of five predictors. This multi-angle selection process enhanced interpretability while minimizing the risk of model overfitting. 13
Model development and validation
To a develop prognostic tool, we applied four ML frameworks—logistic regression (LR), support vector machine (SVM), k-nearest neighbor (KNN), and extreme gradient boosting (XGBoost). Each algorithm was trained using four variable selection methods (LASSO, Boruta, Backward, and BSR), representing distinct selection strategies. Model performance was comparatively evaluated in both training and testing cohorts using multiple discrimination and classification metrics, including the AUC, sensitivity, specificity, balanced accuracy, and recall.
Model performance was assessed through repeated five-fold cross-validation, and the discriminatory ability was quantified by calculating the AUC in both the training and testing cohorts. Reliability of probability estimates was examined via calibration analysis, and the net clinical benefit of the model was quantified through decision curve analysis (DCA).
Model interpretation and prognostic evaluation
Shapley Additive Explanations (SHAP) analysis was employed to quantify the relative contribution of each predictor, thereby improving the interpretability of the ML model and translating complex computational outputs into clinically meaningful insights.
To further facilitate risk assessment, the optimal cutoff value for model-derived predictions was determined in the training cohort, enabling stratification of patients into high- and low-risk subgroups. The prognostic discriminatory capacity of this stratification was rigorously evaluated for overall survival (OS) using Kaplan–Meier survival curves and log-rank tests in both the training and testing cohorts, confirming the stability and generalizability of the model. To enhance clinical applicability, we additionally developed an interactive web-based R Shiny application, allowing clinicians to input individual patient characteristics and immediately obtain personalized risk estimates together with visual interpretation of feature contributions, thus bridging the gap between advanced ML methodology and practical decision-making in PG-DLBCL management.
Statistical analyses were conducted in R with standard packages, and significance was defined as p < 0.05.
Results
Patient characteristics
Baseline characteristics of patients in the total, training and testing cohorts.
Variable screening
Feature selection was performed using four independent strategies, namely LASSO, Boruta, backward stepwise regression, and BSR. The LASSO approach, under the λ1se penalty, identified six key predictors: age, stage, RT, CT, marital status, and income (Figure 1(a) and (b)). The Boruta algorithm, through iterative permutation and ranking, converged on five stable variables, including age, stage, CT, time from diagnosis to treatment, and marital status (Figure 1(c)). Backward stepwise regression yielded a broader set of seven predictors, namely age, race, stage, CT, time from diagnosis to treatment, marital status, and income (Table 2). In contrast, BSR, guided by the adjusted R2 criterion, retained a parsimonious panel of five predictors: age, stage, CT, marital status, and income (Figure 1(d)). Collectively, these results revealed both consistency and heterogeneity across different methods. Age, stage, CT, and marital status were repeatedly selected, highlighting their robust prognostic value, while variables such as race, RT, income, and time from diagnosis to treatment appeared in a method-specific manner, suggesting potential supplementary prognostic information (Table 3). Comparison of feature selection methods. Backward stepwise regression results: univariable, multivariable, and final models. Results of feature selection. RT: radiotherapy; CT: chemotherapy.
Model development and validation
Performance of LR, SVM, XGBoost, and KNN models with different feature selection methods.
LR: logistic regression; SVM: support vector machines; XGBoost: extreme gradient boosting; KNN: k-nearest neighbors.
The predictive performance of the final model was comprehensively assessed across both the training and testing cohorts. In 5-fold cross-validation, the mean AUCs were 0.770 ± 0.017 in the training cohort (Figure 2(a)) and 0.772 ± 0.041 in the testing cohort (Figure 2(b)), demonstrating stable discriminative ability. Calibration analysis showed good agreement between predicted and observed probabilities in both cohorts, with C-indices of 0.784 and 0.779, respectively (Figure 2(c) and (d)). The calibration curves closely followed the ideal 45° line, indicating satisfactory model reliability. Furthermore, DCA revealed that the model provided a higher net clinical benefit across a wide range of threshold probabilities compared with default strategies (treat-all or treat-none) in both the training and testing cohorts (Figure 2(e) and (f)). Together, these findings confirmed that the proposed model achieved robust discrimination, good calibration, and favorable clinical utility. Performance evaluation and clinical utility of the predictive model.
Model interpretation and prognostic evaluation
To enhance model interpretability, SHAP analysis was applied to quantify the contribution of each predictor. As shown in Figure 3(a), age had the highest impact on model output (SHAP value = 0.1655), followed by chemotherapy (0.0588), tumor stage (0.0487), marital status (0.0351), and income (0.0284), highlighting their relative importance in prognostic prediction. The SHAP partial dependence profiles illustrated both the direction and magnitude of each feature’s effect, highlighting nonlinear patterns and potential interactions (Figure 3(b)). Furthermore, individualized SHAP analysis illustrated how specific variables influenced prediction outcomes at the patient level (Figure 3(c)). For the representative patient presented, younger age reduced risk, while the absence of chemotherapy contributed most to risk. Tumor stage, marital status, and income also increased risk, highlighting age as protective and the other factors as risk enhancers, providing an interpretable explanation of the model’s prediction for this patient. SHAP Analysis.
The final model was applied to predict outcomes in the training cohort, and patients were stratified into high- and low-risk groups using the optimal cutoff value. Kaplan–Meier survival analysis demonstrated that the two risk groups exhibited significantly different prognoses in both the training and testing cohorts (p < 0.001, Figure 4(a) and (b)), confirming the model’s strong discriminative ability. These results indicated that the model can effectively distinguish patients with different risk profiles, supporting its potential utility for individualized prognostic assessment and clinical decision-making. A web-based prediction tool was further developed to facilitate real-time, user-friendly access to the model’s outputs (https://seerr.shinyapps.io/appDLBCL/, Figure 5). This platform allows clinicians and researchers to input individual patient characteristics and obtain personalized risk predictions, thereby translating complex ML results into actionable clinical insights. Kaplan–Meier survival analysis for risk stratification. Kaplan–Meier survival curves for patients stratified by the optimal risk cutoff in the training (a) and testing (b) cohorts. Patients classified as high-risk show significantly poorer survival compared with low-risk patients in both cohorts, demonstrating that the model effectively discriminates patient prognosis and provides robust risk stratification. Development of the web-based predictive tool. Schematic overview of the web tool interface, illustrating the input of patient parameters and real-time calculation of individualized risk scores.

Discussion
In this study, we developed and validated a ML–based prognostic model specifically tailored for patients with PG-DLBCL, using a large population-based cohort from the SEER database. By integrating multiple complementary feature selection strategies with diverse ML algorithms, we identified a robust and stable set of prognostic variables and demonstrated that an ensemble-based XGBoost model outperformed conventional statistical approaches. To our knowledge, this represents one of the first systematic attempts to establish a comprehensive prognostic framework for PG-DLBCL through the combined application of multi-strategy variable selection and multi-model ML.
Unlike prior studies that focused primarily on the IPI or Ann Arbor staging, we systematically assessed a broad spectrum of demographic, clinical, and treatment-related variables in a real-world setting. This approach enabled us to capture heterogeneity often overlooked by traditional indices, such as socioeconomic and marital status, both of which consistently emerged as independent predictors of survival. Their inclusion underscores the multifactorial nature of PG-DLBCL outcomes and extends the scope of prognostic assessment. To minimize model instability caused by collinearity and noise, we employed a multi-angle feature selection pipeline combining LASSO, Boruta, backward stepwise elimination, and BSR. Core prognostic factors—age, stage, chemotherapy, and marital status—were consistently identified across all methods, while features such as income and radiotherapy appeared more context-dependent, potentially reflecting nuanced modifiers of outcome. This layered approach enhanced both reproducibility and interpretability, distinguishing our framework from single-method models commonly reported in oncology research.
Among the identified predictors, age was reaffirmed as a dominant factor, consistent with its well-established role in DLBCL prognosis.2,4,18 Older patients often experience poorer outcomes due to comorbidities, reduced treatment tolerance, and less frequent use of intensive immunochemotherapy. Similarly, chemotherapy—particularly rituximab-based regimens—remains a cornerstone of therapy and was strongly associated with improved survival in our analysis.1,4,5,18–20 Beyond biological and treatment-related factors, social and economic variables also shaped outcomes: married patients had superior survival, likely reflecting greater psychosocial support, while higher household income correlated with improved prognosis, potentially due to enhanced access to care.21–24 These findings highlighted the importance of integrating non-biological factors into prognostic models and addressing disparities in PG-DLBCL management. Ultimately, the XGBoost model built on BSR-selected variables yielded the best predictive performance and was adopted as the final framework.
Beyond predictive accuracy, our model provided transparency through SHAP analysis. One of the major critiques of ML in clinical oncology is its “black-box” nature. 25 By quantifying the directionality and magnitude of each predictor’s contribution at both cohort and individual levels, our framework translated complex computational outputs into clinically meaningful explanations. This interpretability is essential for clinician trust and practical adoption, enabling oncologists to understand the rationale for risk classification and to incorporate these insights into treatment planning.25,26
Clinically, the implications of this model are considerable. The clear separation of survival curves between high- and low-risk groups highlights its utility for refining risk stratification, guiding treatment intensity, and identifying patients who may benefit from closer monitoring or clinical trial enrollment. Moreover, the development of a web-based tool ensured real-time accessibility, bridging the gap between computational modeling and bedside decision-making. Such accessibility enhances the likelihood of clinical integration and may ultimately promote more equitable care for PG-DLBCL patients.
Furthermore, our model’s performance should be viewed in the context of established clinical tools like the NCCN-IPI. 27 As demonstrated by Zhou et al., 27 the NCCN-IPI significantly improved the identification of high-risk patients by refining age and LDH categorizations. Our findings aligned with their observation that clinical outcomes in DLBCL are driven by a complex interplay of patient and disease characteristics. However, while the NCCN-IPI is designed for rapid bedside risk grouping, our ML-based approach aimed to provide a more tailored prognostic score. By integrating a wider array of variables available in the SEER database and utilizing XGBoost’s ability to handle non-linear interactions, we offer a supplementary tool that may capture nuances—such as socioeconomic factors or specific treatment combinations—that traditional scoring systems might not fully account for. Consequently, as highlighted by Ismayilov et al., ML represents a powerful evolution in oncology, offering a supplementary tool to address the complex clinical challenges that traditional linear models are less equipped to resolve. 14
Nevertheless, several limitations merit attention. While SEER provides a large-scale and representative dataset, key biological and molecular parameters are not available. Incorporating these molecular signatures would likely enhance prognostic precision and align the model with the evolving biological understanding of DLBCL heterogeneity.28,29 A notable limitation of this study is that the SEER database records chemotherapy as a binary “Yes/No” variable. It lacks specific details regarding the chemotherapy regimen, dosage, or number of cycles, which are critical for survival analysis. From a gastroenterological perspective, the pathogenesis of gastric lymphoma is often linked to chronic H. pylori infection, which can trigger the development of MALT lymphoma and its subsequent transformation into more aggressive forms, such as DLBCL. 30 While our model focuses on survival outcomes based on the characteristics at the time of diagnosis, the potential for MALT transformation remains a key clinical consideration. Future studies with access to primary endoscopic and microbiological data are needed to clarify how H. pylori eradication and prior MALT history might influence the performance of machine learning-based prognostic tools. Furthermore, although internal validation demonstrated robust calibration and discrimination, external validation using independent, multi-institutional cohorts will be necessary to confirm generalizability across diverse patient populations and clinical settings. Despite these limitations, our work establishes a methodological paradigm for rare extranodal lymphomas: the integration of multi-strategy variable selection with ML not only improved predictive performance but also yielded interpretable, clinically actionable tools. Moving forward, such integrative approaches may form the foundation for dynamic prognostic systems that incorporate molecular data, treatment response, and longitudinal follow-up, thereby advancing the promise of precision oncology in PG-DLBCL.
Conclusion
In summary, this study presents a comprehensive, ML–driven prognostic framework specifically for PG-DLBCL. By integrating multi-strategy feature selection with diverse ML algorithms, we identified robust predictors—age, stage, chemotherapy, marital status, and income—that drive survival outcomes. The XGBoost model demonstrated strong discrimination, reliable calibration, and clear clinical utility, while SHAP analysis provided interpretable insights at both cohort and individual levels. Risk stratification effectively separated high- and low-risk patients, highlighting its potential to guide individualized treatment, surveillance, and clinical trial enrollment. This approach offers a novel paradigm for prognostic modeling in rare extranodal lymphomas and lays the foundation for precision oncology in PG-DLBCL.
Supplemental material
Supplemental material -Multi-strategy feature selection and multi-model machine learning for prognostic prediction in primary gastric diffuse large B-cell lymphoma
Supplemental material for Multi-strategy feature selection and multi-model machine learning for prognostic prediction in primary gastric diffuse large B-cell lymphoma by Jingjie Lin, Hanlei Wang, Huirong Lin, Chaowei Xu in DIGITAL HEALTH
Footnotes
Acknowledgments
The authors utilized a large language model (LLM) solely for grammatical refinement and linguistic polishing to improve the readability of the manuscript. All scientific content, data analysis, and interpretations were performed by the authors, and the final manuscript was thoroughly reviewed and approved by all contributors.
Author contributions
Jingjie Lin: Conceptualization, Data curation, Methodology, Formal analysis, Writing—original draft preparation. Hanlei Wang: Data curation, Software, Validation, Writing—review & editing. Huirong Lin: Investigation, Writing—review & editing. Chaowei Xu: Conceptualization, Supervision, Methodology, Writing—review & editing, and Corresponding author responsibilities.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The datasets used during the current study are available from the corresponding author upon reasonable request.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
