Multi-strategy feature selection and multi-model machine learning for prognostic prediction in primary gastric diffuse large B-cell lymphoma

Abstract

Background

Primary gastric diffuse large B-cell lymphoma (PG-DLBCL) exhibits heterogeneous outcomes, and conventional prognostic systems often fail to capture its unique clinicopathological features. We aimed to develop a robust, interpretable prognostic model to improve risk stratification and guide individualized management.

Methods

Data from 3773 PG-DLBCL patients (2000–2021) were extracted from the SEER database. Four complementary feature selection strategies—LASSO, Boruta, backward stepwise elimination, and best subset regression (BSR)—were employed to identify stable prognostic variables. Four machine learning (ML) algorithms (logistic regression, support vector machine, k-nearest neighbor, and XGBoost) were trained using these selected variables. Model performance was evaluated through discrimination (AUC), calibration, decision curve analysis, and internal validation. Shapley Additive Explanations (SHAP) were applied for interpretability, and patients were stratified into high- and low-risk groups.

Results

Age, stage, chemotherapy, marital status, and income emerged as key prognostic determinants. The XGBoost model based on BSR-selected predictors achieved the highest performance with good calibration and net clinical benefit. SHAP analysis demonstrated that older age, advanced stage and absence of chemotherapy increased predicted risk, whereas marital status and higher income were protective. Risk stratification effectively distinguished survival outcomes in training and testing cohorts (p < 0.001). A web-based tool was developed for individualized risk assessment.

Conclusions

We established an interpretable, high-performing ML-based prognostic model for PG-DLBCL that integrates clinical, treatment, and socioeconomic factors. This tool enables precise risk stratification, supports individualized therapeutic decisions, and provides a methodological framework for prognostic modeling in this rare extranodal lymphomas.

Keywords

primary gastric diffuse large B-cell lymphoma SEER machine learning SHAP prediction

Introduction

Primary gastric diffuse large B-cell lymphoma (PG-DLBCL) is the most frequent subtype of primary gastric lymphoma and constitutes the majority of extranodal diffuse large B-cell lymphoma cases.^1–3 Although the introduction of rituximab-based immunochemotherapy has significantly improved survival outcomes, the prognosis of PG-DLBCL remains highly heterogeneous due to variations in age, disease stage, tumor burden, and molecular characteristics. Conventional prognostic systems, such as the International Prognostic Index (IPI) and Ann Arbor stage, were developed for nodal DLBCL and may not adequately capture the unique clinicopathological features of gastric involvement.^2,4–8 Consequently, a substantial proportion of patients are either under- or overestimated in terms of risk, which may lead to suboptimal treatment decisions.

Given these challenges, the development of a robust prognostic model specifically tailored for PG-DLBCL is of critical importance. Such a model would enable accurate survival prediction, refine risk stratification, and facilitate individualized therapeutic strategies. Moreover, it would provide clinicians with an evidence-based tool to identify high-risk patients who may benefit from more intensive treatment and close surveillance, while sparing low-risk patients from unnecessary toxicity. In the era of precision medicine, constructing reliable and validated prognostic models is not only essential for guiding clinical management but also for informing clinical trial design and advancing translational research in PG-DLBCL.

In recent years, population-based cancer registries such as the Surveillance, Epidemiology, and End Results (SEER) database have provided unprecedented opportunities to explore prognostic determinants across large and diverse cohorts. However, conventional regression-based approaches applied to SEER data often struggle with high-dimensionality, multicollinearity, and the presence of complex interactions. This raises the necessity for advanced statistical and machine learning (ML) approaches capable of handling multifaceted prognostic landscapes.⁹ ML methods have emerged as powerful tools for survival prediction and risk modeling in oncology. By leveraging non-linear modeling capacity, automatic feature selection, and ensemble learning strategies, ML algorithms have demonstrated superiority over traditional methods in predictive performance.^9–13 Recent studies, such as the work by Ismayilov et al. on predicting CNS involvement, demonstrate the ability of ML to handle complex clinical scenarios.¹⁴ Nonetheless, most prior studies on DLBCL prognosis have focused either on nodal disease or single-algorithm applications,^7,8,15–17 with few efforts devoted to PG-DLBCL specifically. Additionally, systematic integration of multiple variable selection strategies with multi-model ML frameworks remains scarce, potentially limiting robustness and generalizability.

To bridge these gaps, the present study aimed to establish a comprehensive prognostic evaluation and risk stratification tool for patients with PG-DLBCL using population-based SEER data. By integrating multi-strategy variable selection methods with diverse ML algorithms, we endeavor to (i) identify key clinical and pathological determinants of PG-DLBCL survival with high robustness, (ii) construct and compare multiple predictive models to maximize prognostic accuracy, and (iii) develop a clinically applicable risk stratification framework to guide individualized patient management. This integrative approach is expected to offer a novel methodological paradigm for prognostic modeling in rare extranodal lymphomas and ultimately facilitate precision oncology decision-making in PG-DLBCL.

Methods

Data source

Patient information for PG-DLBCL diagnosed between 2000 and 2021 was retrieved from the SEER program of the National Cancer Institute through SEER*Stat software. The SEER program is a publicly available, population-based cancer registry with rigorous quality control, which ensures large-scale, representative, and long-term survival data. Because the dataset is de-identified and freely accessible, separate institutional ethical approval and informed consent were not required.

A total of 3773 eligible cases were identified according to tumor site codes C16.0-C16.9 and the ICD-O-3 histological code 9680/3. Clinical and demographic variables were extracted, including age at diagnosis, sex, race, marital status, socioeconomic indicators (county-level median household income and rural/urban status), tumor site, Ann Arbor stage, time from diagnosis to initial treatment, treatment modalities [surgery, chemotherapy (CT), radiotherapy (RT)], and survival outcomes. Patients were excluded if they (1) did not have a pathologically confirmed diagnosis, (2) were not diagnosed with PG-DLBCL as their first primary malignancy, (3) had incomplete or unavailable follow-up information, or (4) lacked sufficient data on essential clinicopathological characteristics.

Sensitivity analyses were performed to explore the optimal partitioning strategy for the derivation cohort. Across all tested split ratios, the area under the curve (AUC) consistently exceeded 0.75 (Supplemental Figure 1), indicating stable generalization capacity of the models. Based on this robustness and to balance adequate sample size for model training with sufficient cases for internal validation, the cohort was ultimately divided into a training set and a testing set in a 7:3 ratio.

Variable screening

To ensure that the prognostic model was constructed on robust and informative predictors, we applied four complementary feature selection strategies. The Least Absolute Shrinkage and Selection Operator (LASSO), under the λ1se penalty, reduced dimensionality and retained six covariates with the strongest predictive value. The Boruta algorithm, through iterative permutation and ranking, identified five stable variables as essential features. Backward stepwise elimination yielded seven predictors, while best subset regression (BSR), guided by the adjusted R² criterion, converged on a parsimonious set of five predictors. This multi-angle selection process enhanced interpretability while minimizing the risk of model overfitting.¹³

Model development and validation

To a develop prognostic tool, we applied four ML frameworks—logistic regression (LR), support vector machine (SVM), k-nearest neighbor (KNN), and extreme gradient boosting (XGBoost). Each algorithm was trained using four variable selection methods (LASSO, Boruta, Backward, and BSR), representing distinct selection strategies. Model performance was comparatively evaluated in both training and testing cohorts using multiple discrimination and classification metrics, including the AUC, sensitivity, specificity, balanced accuracy, and recall.

Model performance was assessed through repeated five-fold cross-validation, and the discriminatory ability was quantified by calculating the AUC in both the training and testing cohorts. Reliability of probability estimates was examined via calibration analysis, and the net clinical benefit of the model was quantified through decision curve analysis (DCA).

Model interpretation and prognostic evaluation

Shapley Additive Explanations (SHAP) analysis was employed to quantify the relative contribution of each predictor, thereby improving the interpretability of the ML model and translating complex computational outputs into clinically meaningful insights.

To further facilitate risk assessment, the optimal cutoff value for model-derived predictions was determined in the training cohort, enabling stratification of patients into high- and low-risk subgroups. The prognostic discriminatory capacity of this stratification was rigorously evaluated for overall survival (OS) using Kaplan–Meier survival curves and log-rank tests in both the training and testing cohorts, confirming the stability and generalizability of the model. To enhance clinical applicability, we additionally developed an interactive web-based R Shiny application, allowing clinicians to input individual patient characteristics and immediately obtain personalized risk estimates together with visual interpretation of feature contributions, thus bridging the gap between advanced ML methodology and practical decision-making in PG-DLBCL management.

Statistical analyses were conducted in R with standard packages, and significance was defined as p < 0.05.

Results

Patient characteristics

A total of 3773 patients with PG-DLBCL were identified, including 2642 in the training cohort and 1131 in the testing cohort (Table 1). More than half (54.6%) were older than 60 years, with the 70-79 age group the largest (25.9%). Male patients predominated (56.9%), and most were White (77.2%). Tumors most often arose in the middle stomach (29.3%) or overlapping/unspecified sites (49.7%), while cardia involvement was rare (4.5%). Early-stage disease accounted for 66.6% of cases, though one-third presented with stage III/IV. Only 11.1% underwent surgery, 17.2% received radiotherapy, and chemotherapy remained the mainstay (73.4%).

Table 1.

Baseline characteristics of patients in the total, training and testing cohorts.

Characteristic	Level	Total	Training cohort	Testing cohort
Characteristic	Level	n = 3773	n = 2642	n = 1131
Age	<50	504 (13.4)	367 (13.9)	137 (12.1)
	50-59	594 (15.7)	421 (15.9)	173 (15.3)
	60-69	822 (21.8)	569 (21.5)	253 (22.4)
	70-79	969 (25.7)	674 (25.5)	295 (26.1)
	≥80	884 (23.4)	611 (23.1)	273 (24.1)
Sex	Male	2131 (56.5)	1479 (56.0)	652 (57.6)
Sex	Female	1642 (43.5)	1163 (44.0)	479 (42.4)
Race	White	2913 (77.2)	2056 (77.8)	857 (75.8)
	Black	318 (8.4)	220 (8.3)	98 (8.7)
	Others	542 (14.4)	366 (13.9)	176 (15.6)
Tumor Site	Cardia	168 (4.5)	113 (4.3)	55 (4.9)
	Middle	1106 (29.3)	773 (29.3)	333 (29.4)
	Distal	624 (16.5)	440 (16.7)	184 (16.3)
	Overlapping/NOS	1875 (49.7)	1316 (49.8)	559 (49.4)
Tumor Stage	I/II	2512 (66.6)	1753 (66.4)	759 (67.1)
Tumor Stage	III/IV	1261 (33.4)	889 (33.6)	372 (32.9)
Surgery	No	3354 (88.9)	2352 (89.0)	1002 (88.6)
Surgery	Yes	419 (11.1)	290 (11.0)	129 (11.4)
Radiotherapy	No	3125 (82.8)	2203 (83.4)	922 (81.5)
Radiotherapy	Yes	648 (17.2)	439 (16.6)	209 (18.5)
Chemotherapy	No	1004 (26.6)	677 (25.6)	327 (28.9)
Chemotherapy	Yes	2769 (73.4)	1965 (74.4)	804 (71.1)
Time From Diagnosis to Treatment	<1 month	1811 (48.0)	1296 (49.1)	515 (45.5)
	≥1 month	901 (23.9)	633 (24.0)	268 (23.7)
	Unknown	1061 (28.1)	713 (27.0)	348 (30.8)
Marital Status	Single	1614 (42.8)	1114 (42.2)	500 (44.2)
Marital Status	Married	2159 (57.2)	1528 (57.8)	631 (55.8)
County-level median household income	<$80000	1913 (50.7)	1327 (50.2)	586 (51.8)
County-level median household income	≥$80000	1860 (49.3)	1315 (49.8)	545 (48.2)
Rural-Urban	Nonmetropolitan	431 (11.4)	308 (11.7)	123 (10.9)
Rural-Urban	Metropolitan	3342 (88.6)	2334 (88.3)	1008 (89.1)
Events (Death=1)	n (%)	2451 (65.0)	1675 (63.4)	776 (68.6)

Variable screening

Feature selection was performed using four independent strategies, namely LASSO, Boruta, backward stepwise regression, and BSR. The LASSO approach, under the λ1se penalty, identified six key predictors: age, stage, RT, CT, marital status, and income (Figure 1(a) and (b)). The Boruta algorithm, through iterative permutation and ranking, converged on five stable variables, including age, stage, CT, time from diagnosis to treatment, and marital status (Figure 1(c)). Backward stepwise regression yielded a broader set of seven predictors, namely age, race, stage, CT, time from diagnosis to treatment, marital status, and income (Table 2). In contrast, BSR, guided by the adjusted R² criterion, retained a parsimonious panel of five predictors: age, stage, CT, marital status, and income (Figure 1(d)). Collectively, these results revealed both consistency and heterogeneity across different methods. Age, stage, CT, and marital status were repeatedly selected, highlighting their robust prognostic value, while variables such as race, RT, income, and time from diagnosis to treatment appeared in a method-specific manner, suggesting potential supplementary prognostic information (Table 3).

Figure 1.

Comparison of feature selection methods.

Table 2.

Backward stepwise regression results: univariable, multivariable, and final models.

		OR (univariable)	OR (multivariable)	OR (final)
Age	<50 years
	50-59	1.31 (0.98-1.75, p=0.064)	1.46 (1.08-1.98, p=0.014)	1.47 (1.08-1.98, p=0.013)
	60-69	2.01 (1.54-2.63, p<.001)	2.42 (1.81-3.22, p<.001)	2.41 (1.81-3.21, p<.001)
	70-79	5.11 (3.89-6.73, p<.001)	6.31 (4.7-8.48, p<.001)	6.15 (4.59-8.23, p<.001)
	≥80	16.27 (11.55-22.91, p<.001)	17.53 (12.17-25.23, p<.001)	16.39 (11.43-23.5, p<.001)
Sex	Male
Sex	Female	1.13 (0.96-1.33, p=0.129)	0.82 (0.68-0.99, p=0.041)
Race	White
	Black	1.32 (0.98-1.79, p=0.071)	1.97 (1.4-2.77, p<.001)	1.98 (1.41-2.78, p<.001)
	Others	0.74 (0.59-0.93, p=0.011)	0.85 (0.65-1.11, p=0.228)	0.85 (0.65-1.1, p=0.217)
Tumor Site	Reference
	Middle	0.85 (0.56-1.29, p=0.440)	0.78 (0.49-1.24, p=0.292)
	Distal	0.67 (0.44-1.04, p=0.073)	0.65 (0.4-1.07, p=0.088)
	Overlapping/NOS	0.97 (0.65-1.46, p=0.890)	0.95 (0.6-1.5, p=0.819)
Tumor Stage	Reference
Tumor Stage	III/IV	1.51 (1.27-1.8, p<.001)	1.68 (1.37-2.05, p<.001)	1.77 (1.46-2.15, p<.001)
Surgery	No Surgery
Surgery	Yes	1.15 (0.89-1.48, p=0.293)	1.3 (0.95-1.76, p=0.098)
Radiotherapy	Reference
Radiotherapy	Yes	0.62 (0.51-0.77, p<.001)	0.89 (0.7-1.13, p=0.337)
Chemotherapy	Reference
Chemotherapy	Yes	0.28 (0.22-0.34, p<.001)	0.53 (0.39-0.71, p<.001)	0.5 (0.38-0.67, p<.001)
Time From Diagnosis to Treatment	<1 month
	≥1 month	0.71 (0.59-0.86, p<.001)	0.72 (0.58-0.89, p=0.003)	0.69 (0.56-0.86, p<.001)
	Unknown	2.29 (1.86-2.82, p<.001)	1.32 (0.99-1.75, p=0.060)	1.26 (0.96-1.65, p=0.100)
Marital Status	Reference
Marital Status	Married	0.61 (0.52-0.72, p<.001)	0.69 (0.57-0.84, p<.001)	0.71 (0.59-0.86, p<.001)
County-level median household income	<$80000
County-level median household income	≥$80000	0.74 (0.63-0.87, p<.001)	0.8 (0.66-0.97, p=0.026)	0.77 (0.64-0.92, p=0.004)
Rural-Urban	Reference
Rural-Urban	Metropolitan	0.75 (0.58-0.97, p=0.026)	0.93 (0.68-1.27, p=0.659)

Table 3.

Results of feature selection.

Method	Features_selected	Selected_features
LASSO	6	Age, Stage, RT, CT, Marital status, Income
Boruta	5	Age, Stage, CT, Time from diagnosis to treatment, Marital status
Backward	7	Age, Race, Stage, CT, Time from diagnosis to treatment, Marital status, Income
BestSubset	5	Age, Stage, CT, Marital status, Income

RT: radiotherapy; CT: chemotherapy.

Model development and validation

Four ML algorithms—LR, SVM, KNN, and XGBoost—were implemented in combination with four distinct feature selection strategies (LASSO, Boruta, backward stepwise regression, and BSR), to construct prognostic models. Performance comparisons revealed that the XGBoost model built on BSR-selected predictors achieved the highest discriminative ability and overall stability, thereby being chosen as the final predictive framework. A detailed comparison of algorithmic performance across different variable selection methods is presented in Table 4.

Table 4.

Performance of LR, SVM, XGBoost, and KNN models with different feature selection methods.

Model	AUC		Sensitivity		Specificity		Balanced accuracy		F1 score
Model	Training	Testing	Training	Testing	Training	Testing	Training	Testing	Training	Testing
LASSO
LR	0.775	0.775	0.807	0.802	0.559	0.569	0.683	0.685	0.783	0.802
SVM	0.757	0.742	0.887	0.89	0.416	0.431	0.651	0.661	0.798	0.828
KNN	0.693	0.682	0.839	0.807	0.547	0.558	0.693	0.682	0.799	0.803
XGBoost	0.787	0.781	0.829	0.825	0.542	0.566	0.686	0.695	0.792	0.815
Boruta
LR	0.774	0.778	0.785	0.789	0.576	0.58	0.681	0.684	0.774	0.796
SVM	0.747	0.739	0.888	0.899	0.403	0.4	0.646	0.65	0.795	0.828
KNN	0.698	0.681	0.79	0.78	0.606	0.583	0.698	0.681	0.783	0.791
XGBoost	0.786	0.784	0.817	0.814	0.549	0.541	0.683	0.678	0.787	0.805
Backward
LR	0.781	0.78	0.807	0.798	0.572	0.566	0.69	0.682	0.786	0.799
SVM	0.781	0.753	0.854	0.851	0.485	0.487	0.669	0.669	0.794	0.816
KNN	0.709	0.659	0.801	0.769	0.616	0.549	0.709	0.659	0.792	0.779
XGBoost	0.798	0.783	0.821	0.812	0.566	0.569	0.693	0.69	0.793	0.808
BestSubset
LR	0.775	0.774	0.811	0.807	0.557	0.569	0.684	0.688	0.785	0.805
SVM	0.734	0.727	0.901	0.907	0.377	0.377	0.639	0.642	0.797	0.828
KNN	0.685	0.678	0.839	0.807	0.531	0.549	0.685	0.678	0.795	0.802
XGBoost	0.784	0.779	0.831	0.832	0.534	0.569	0.682	0.701	0.791	0.82

LR: logistic regression; SVM: support vector machines; XGBoost: extreme gradient boosting; KNN: k-nearest neighbors.

The predictive performance of the final model was comprehensively assessed across both the training and testing cohorts. In 5-fold cross-validation, the mean AUCs were 0.770 ± 0.017 in the training cohort (Figure 2(a)) and 0.772 ± 0.041 in the testing cohort (Figure 2(b)), demonstrating stable discriminative ability. Calibration analysis showed good agreement between predicted and observed probabilities in both cohorts, with C-indices of 0.784 and 0.779, respectively (Figure 2(c) and (d)). The calibration curves closely followed the ideal 45° line, indicating satisfactory model reliability. Furthermore, DCA revealed that the model provided a higher net clinical benefit across a wide range of threshold probabilities compared with default strategies (treat-all or treat-none) in both the training and testing cohorts (Figure 2(e) and (f)). Together, these findings confirmed that the proposed model achieved robust discrimination, good calibration, and favorable clinical utility.

Figure 2.

Performance evaluation and clinical utility of the predictive model.

Model interpretation and prognostic evaluation

To enhance model interpretability, SHAP analysis was applied to quantify the contribution of each predictor. As shown in Figure 3(a), age had the highest impact on model output (SHAP value = 0.1655), followed by chemotherapy (0.0588), tumor stage (0.0487), marital status (0.0351), and income (0.0284), highlighting their relative importance in prognostic prediction. The SHAP partial dependence profiles illustrated both the direction and magnitude of each feature’s effect, highlighting nonlinear patterns and potential interactions (Figure 3(b)). Furthermore, individualized SHAP analysis illustrated how specific variables influenced prediction outcomes at the patient level (Figure 3(c)). For the representative patient presented, younger age reduced risk, while the absence of chemotherapy contributed most to risk. Tumor stage, marital status, and income also increased risk, highlighting age as protective and the other factors as risk enhancers, providing an interpretable explanation of the model’s prediction for this patient.

Figure 3.

SHAP Analysis.

The final model was applied to predict outcomes in the training cohort, and patients were stratified into high- and low-risk groups using the optimal cutoff value. Kaplan–Meier survival analysis demonstrated that the two risk groups exhibited significantly different prognoses in both the training and testing cohorts (p < 0.001, Figure 4(a) and (b)), confirming the model’s strong discriminative ability. These results indicated that the model can effectively distinguish patients with different risk profiles, supporting its potential utility for individualized prognostic assessment and clinical decision-making. A web-based prediction tool was further developed to facilitate real-time, user-friendly access to the model’s outputs (https://seerr.shinyapps.io/appDLBCL/, Figure 5). This platform allows clinicians and researchers to input individual patient characteristics and obtain personalized risk predictions, thereby translating complex ML results into actionable clinical insights.

Figure 4.

Kaplan–Meier survival analysis for risk stratification. Kaplan–Meier survival curves for patients stratified by the optimal risk cutoff in the training (a) and testing (b) cohorts. Patients classified as high-risk show significantly poorer survival compared with low-risk patients in both cohorts, demonstrating that the model effectively discriminates patient prognosis and provides robust risk stratification.

Figure 5.

Development of the web-based predictive tool. Schematic overview of the web tool interface, illustrating the input of patient parameters and real-time calculation of individualized risk scores.

Discussion

In this study, we developed and validated a ML–based prognostic model specifically tailored for patients with PG-DLBCL, using a large population-based cohort from the SEER database. By integrating multiple complementary feature selection strategies with diverse ML algorithms, we identified a robust and stable set of prognostic variables and demonstrated that an ensemble-based XGBoost model outperformed conventional statistical approaches. To our knowledge, this represents one of the first systematic attempts to establish a comprehensive prognostic framework for PG-DLBCL through the combined application of multi-strategy variable selection and multi-model ML.

Unlike prior studies that focused primarily on the IPI or Ann Arbor staging, we systematically assessed a broad spectrum of demographic, clinical, and treatment-related variables in a real-world setting. This approach enabled us to capture heterogeneity often overlooked by traditional indices, such as socioeconomic and marital status, both of which consistently emerged as independent predictors of survival. Their inclusion underscores the multifactorial nature of PG-DLBCL outcomes and extends the scope of prognostic assessment. To minimize model instability caused by collinearity and noise, we employed a multi-angle feature selection pipeline combining LASSO, Boruta, backward stepwise elimination, and BSR. Core prognostic factors—age, stage, chemotherapy, and marital status—were consistently identified across all methods, while features such as income and radiotherapy appeared more context-dependent, potentially reflecting nuanced modifiers of outcome. This layered approach enhanced both reproducibility and interpretability, distinguishing our framework from single-method models commonly reported in oncology research.

Among the identified predictors, age was reaffirmed as a dominant factor, consistent with its well-established role in DLBCL prognosis.^2,4,18 Older patients often experience poorer outcomes due to comorbidities, reduced treatment tolerance, and less frequent use of intensive immunochemotherapy. Similarly, chemotherapy—particularly rituximab-based regimens—remains a cornerstone of therapy and was strongly associated with improved survival in our analysis.^{1,4,5,18–20} Beyond biological and treatment-related factors, social and economic variables also shaped outcomes: married patients had superior survival, likely reflecting greater psychosocial support, while higher household income correlated with improved prognosis, potentially due to enhanced access to care.^21–24 These findings highlighted the importance of integrating non-biological factors into prognostic models and addressing disparities in PG-DLBCL management. Ultimately, the XGBoost model built on BSR-selected variables yielded the best predictive performance and was adopted as the final framework.

Beyond predictive accuracy, our model provided transparency through SHAP analysis. One of the major critiques of ML in clinical oncology is its “black-box” nature.²⁵ By quantifying the directionality and magnitude of each predictor’s contribution at both cohort and individual levels, our framework translated complex computational outputs into clinically meaningful explanations. This interpretability is essential for clinician trust and practical adoption, enabling oncologists to understand the rationale for risk classification and to incorporate these insights into treatment planning.^25,26

Clinically, the implications of this model are considerable. The clear separation of survival curves between high- and low-risk groups highlights its utility for refining risk stratification, guiding treatment intensity, and identifying patients who may benefit from closer monitoring or clinical trial enrollment. Moreover, the development of a web-based tool ensured real-time accessibility, bridging the gap between computational modeling and bedside decision-making. Such accessibility enhances the likelihood of clinical integration and may ultimately promote more equitable care for PG-DLBCL patients.

Furthermore, our model’s performance should be viewed in the context of established clinical tools like the NCCN-IPI.²⁷ As demonstrated by Zhou et al.,²⁷ the NCCN-IPI significantly improved the identification of high-risk patients by refining age and LDH categorizations. Our findings aligned with their observation that clinical outcomes in DLBCL are driven by a complex interplay of patient and disease characteristics. However, while the NCCN-IPI is designed for rapid bedside risk grouping, our ML-based approach aimed to provide a more tailored prognostic score. By integrating a wider array of variables available in the SEER database and utilizing XGBoost’s ability to handle non-linear interactions, we offer a supplementary tool that may capture nuances—such as socioeconomic factors or specific treatment combinations—that traditional scoring systems might not fully account for. Consequently, as highlighted by Ismayilov et al., ML represents a powerful evolution in oncology, offering a supplementary tool to address the complex clinical challenges that traditional linear models are less equipped to resolve.¹⁴

Nevertheless, several limitations merit attention. While SEER provides a large-scale and representative dataset, key biological and molecular parameters are not available. Incorporating these molecular signatures would likely enhance prognostic precision and align the model with the evolving biological understanding of DLBCL heterogeneity.^28,29 A notable limitation of this study is that the SEER database records chemotherapy as a binary “Yes/No” variable. It lacks specific details regarding the chemotherapy regimen, dosage, or number of cycles, which are critical for survival analysis. From a gastroenterological perspective, the pathogenesis of gastric lymphoma is often linked to chronic H. pylori infection, which can trigger the development of MALT lymphoma and its subsequent transformation into more aggressive forms, such as DLBCL.³⁰ While our model focuses on survival outcomes based on the characteristics at the time of diagnosis, the potential for MALT transformation remains a key clinical consideration. Future studies with access to primary endoscopic and microbiological data are needed to clarify how H. pylori eradication and prior MALT history might influence the performance of machine learning-based prognostic tools. Furthermore, although internal validation demonstrated robust calibration and discrimination, external validation using independent, multi-institutional cohorts will be necessary to confirm generalizability across diverse patient populations and clinical settings. Despite these limitations, our work establishes a methodological paradigm for rare extranodal lymphomas: the integration of multi-strategy variable selection with ML not only improved predictive performance but also yielded interpretable, clinically actionable tools. Moving forward, such integrative approaches may form the foundation for dynamic prognostic systems that incorporate molecular data, treatment response, and longitudinal follow-up, thereby advancing the promise of precision oncology in PG-DLBCL.

Conclusion

In summary, this study presents a comprehensive, ML–driven prognostic framework specifically for PG-DLBCL. By integrating multi-strategy feature selection with diverse ML algorithms, we identified robust predictors—age, stage, chemotherapy, marital status, and income—that drive survival outcomes. The XGBoost model demonstrated strong discrimination, reliable calibration, and clear clinical utility, while SHAP analysis provided interpretable insights at both cohort and individual levels. Risk stratification effectively separated high- and low-risk patients, highlighting its potential to guide individualized treatment, surveillance, and clinical trial enrollment. This approach offers a novel paradigm for prognostic modeling in rare extranodal lymphomas and lays the foundation for precision oncology in PG-DLBCL.

Supplemental material

Supplemental material -Multi-strategy feature selection and multi-model machine learning for prognostic prediction in primary gastric diffuse large B-cell lymphoma

Supplemental material for Multi-strategy feature selection and multi-model machine learning for prognostic prediction in primary gastric diffuse large B-cell lymphoma by Jingjie Lin, Hanlei Wang, Huirong Lin, Chaowei Xu in DIGITAL HEALTH

Footnotes

Acknowledgments

The authors utilized a large language model (LLM) solely for grammatical refinement and linguistic polishing to improve the readability of the manuscript. All scientific content, data analysis, and interpretations were performed by the authors, and the final manuscript was thoroughly reviewed and approved by all contributors.

ORCID iD

Chaowei Xu

Author contributions

Jingjie Lin: Conceptualization, Data curation, Methodology, Formal analysis, Writing—original draft preparation. Hanlei Wang: Data curation, Software, Validation, Writing—review & editing. Huirong Lin: Investigation, Writing—review & editing. Chaowei Xu: Conceptualization, Supervision, Methodology, Writing—review & editing, and Corresponding author responsibilities.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets used during the current study are available from the corresponding author upon reasonable request.*

Supplemental material

Supplemental material for this article is available online.

References

Lewis

Joy

Jensen

, et al. Primary gastric diffuse large B-cell lymphoma: A multicentre retrospective study. Br J Haematol 2024; 205: 534–541. https://doi.org/10.1111/bjh.19470

Bardakci

Esmer

Hafizoglu

, et al. Evaluation of clinical and prognostic factors for primary gastric diffuse large B-cell lymphoma: Single-center experience. J Cancer Res Ther 2023; 19: 778–782. https://doi.org/10.4103/jcrt.jcrt_2111_21

Ferreri

Montalban

. Primary diffuse large B-cell lymphoma of the stomach. Crit Rev Oncol Hematol 2007; 63: 65–71. https://doi.org/10.1016/j.critrevonc.2007.01.003

Bai

Zhou

. A systematic review of primary gastric diffuse large B-cell lymphoma: Clinical diagnosis, staging, treatment and prognostic factors. Leuk Res 2021; 111: 106716. https://doi.org/10.1016/j.leukres.2021.106716

Yang

Shen

, et al. Treatment Strategies and Prognostic Factors of Primary Gastric Diffuse Large B Cell Lymphoma: A Retrospective Multicenter Study of 272 Cases from the China Lymphoma Patient Registry. International Journal of Medical Sciences 2019; 16: 1023–1031. https://doi.org/10.7150/ijms.34175

Jiang

Ding

, et al.

Will Baseline Total Lesion Glycolysis Play a Role in Improving the Prognostic Value of the NCCN-IPI in Primary Gastric Diffuse Large B-Cell Lymphoma Patients Treated With the R-CHOP Regimen?

Clin Nucl Med 2021; 46: 1–7. https://doi.org/10.1097/RLU.0000000000003378

Lin

, et al. Dynamic prediction of long-term survival in patients with primary gastric diffuse large B-cell lymphoma: a SEER population-based study. BMC Cancer 2019; 19: 873. https://doi.org/10.1186/s12885-019-5993-6

Lin

Ruan

, et al. A Novel Prognostic Model for Patients with Primary Gastric Diffuse Large B-Cell Lymphoma. J Oncol 2022; 2022: 9636790. https://doi.org/10.1155/2022/9636790

Kourou

Exarchos

, et al. Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal 2015; 13: 8–17. https://doi.org/10.1016/j.csbj.2014.11.005

10.

Tang

Zhang

Liang

, et al. Prostate cancer treatment recommendation study based on machine learning and SHAP interpreter. Cancer Sci 2024; 115: 3755–3766. https://doi.org/10.1111/cas.16327

11.

Liu

Zhang

, et al. Novel models by machine learning to predict prognosis of breast cancer brain metastases. J Transl Med 2023; 21: 404. https://doi.org/10.1186/s12967-023-04277-2

12.

She

Jin

, et al. Development and Validation of a Deep Learning Model for Non-Small Cell Lung Cancer Survival. JAMA Netw Open 2020; 3: e205842. https://doi.org/10.1001/jamanetworkopen.2020.5842

13.

Xiang

Tie

Zhang

, et al. Explainable machine learning model for predicting internal mammary node metastasis in breast cancer: Multi-method development and cross-cohort validation. Breast 2025; 82: 104517. https://doi.org/10.1016/j.breast.2025.104517

14.

Rashad

Murat

Aysegul

, et al. Machine learning-based risk prediction model for central nervous system involvement in diffuse large B-cell lymphoma. Leuk Lymphoma 2025; 66: 2054–2063.

15.

Lin

Ruan

, et al. A Novel Prognostic Model for Patients with Primary Gastric Diffuse Large B-Cell Lymphoma. Journal of Oncology 2022; 2022: 1–9. https://doi.org/10.1155/2022/9636790

16.

Deng

Zhou

, et al. Development and validation of nomograms by radiomic features on ultrasound imaging for predicting overall survival in patients with primary nodal diffuse large B-cell lymphoma. Front Oncol 2022; 12: 991948. https://doi.org/10.3389/fonc.2022.991948

17.

Wang

Chen

Wang

, et al. A new nomogram for assessing complete response (CR) in gastric diffuse large B-cell lymphoma (DLBCL) patients after chemotherapy. J Cancer Res Clin Oncol 2023; 149: 9757–9765. https://doi.org/10.1007/s00432-023-04862-4

18.

Feng

Zheng

Sun

, et al. Location-specific analysis of clinicopathological characteristics and long-term prognosis of primary gastrointestinal diffuse large B-cell lymphoma. Sci Rep 2025; 15: 19574. https://doi.org/10.1038/s41598-025-04537-9

19.

Aviles

Nambo

Neri

. Primary gastric diffuse large B-cell lymphoma: The role of dose-dense chemotherapy. J Oncol Pharm Pract 2019; 25: 1682–1686. https://doi.org/10.1177/1078155218809458

20.

Zhang

Yang

, et al. Rituximab in treatment of primary gastric diffuse large B-cell lymphoma. Leukemia & Lymphoma 2012; 53: 2175–2181. https://doi.org/10.3109/10428194.2012.680451

21.

Aizer

Chen

McCarthy

, et al. Marital status and survival in patients with cancer. J Clin Oncol 2013; 31: 3869–3876. https://doi.org/10.1200/JCO.2013.49.6489

22.

Costa

Brill

Brown

. Impact of marital status, insurance status, income, and race/ethnicity on the survival of younger patients diagnosed with multiple myeloma in the United States. Cancer 2016; 122: 3183–3190. https://doi.org/10.1002/cncr.30183

23.

Yuan

Huang

, et al. Socioeconomic deprivation and survival outcomes in patients with colorectal cancer. Am J Cancer Res 2022; 12: 829–838.

24.

Borate

Mineishi

Costa

. Nonbiological factors affecting survival in younger patients with acute myeloid leukemia. Cancer 2015; 121: 3877–3884. https://doi.org/10.1002/cncr.29436

25.

Bernard

Doumard

Ader

, et al. Explainable machine learning framework to predict personalized physiological aging. Aging Cell 2023; 22: e13872. https://doi.org/10.1111/acel.13872

26.

Hsu

Weng

, et al. Explainable machine learning model for predicting skeletal muscle loss during surgery and adjuvant chemotherapy in ovarian cancer. J Cachexia Sarcopenia Muscle 2023; 14: 2044–2053. https://doi.org/10.1002/jcsm.13282

27.

Zheng

Laurie

Alfred

, et al. An enhanced International Prognostic Index (NCCN-IPI) for patients with diffuse large B-cell lymphoma treated in the rituximab era. Blood 2013; 123: 837-842.

28.

Zhai

Liu

, et al. Whole-exome sequencing analysis identifies distinct mutational profile and novel prognostic biomarkers in primary gastrointestinal diffuse large B-cell lymphoma. Exp Hematol Oncol 2022; 11: 71. https://doi.org/10.1186/s40164-022-00325-7

29.

Chai

Chen

, et al. Genomic Mutation Profile of Primary Gastrointestinal Diffuse Large B-Cell Lymphoma. Front Oncol 2021; 11: 622648. https://doi.org/10.3389/fonc.2021.622648

30.

LJCMI

. Infection-associated non-Hodgkin lymphomas. Clin Microbiol Infect 2015; 21:991-997.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.06 MB