Sage Journals: Discover world-class research

Abstract

Objective

Cardiovascular disease constitutes the primary cause of mortality in long-term breast cancer (BC) survivors, yet predictive tools for cardiovascular-specific survival (CSS) in those with a second primary cancer (SPC) remain limited. This study aims to develop a machine learning (ML) model predicting CSS in BC patients with SPC (BC-SPC).

Methods

Patients with BC-SPC diagnosed between 2010 and 2021 were identified from the surveillance, epidemiology, and end results (SEER) database. After screening variables through Least absolute shrinkage and selection operator (LASSO) regression, five predictive models were constructed respectively: extreme gradient boosting (XGBoost), Cox proportional hazards model, random survival forest (RSF), DeepSurv, and support vector machine (SVM). Model performance was assessed using the concordance index (C-index), area under the receiver operating characteristic curve (AUC), calibration curves and decision curve analysis (DCA). Performing SHapley Additive exPlanations (SHAP) analysis and visualization for the optimal model.

Results

A total of 22,814 BC-SPC patients were included. Among these, 565 cardiovascular disease-specific deaths occurred, with cumulative incidence rates of 1.29%, 3.06%, and 4.30% at 5, 8, and 10 years, respectively. RSF demonstrated optimal performance, with a C-index of 0.749 in training set and 0.752 in validation set. Time-dependent AUCs at 5, 8, and 10 years were 0.774, 0.761, and 0.766 for the training set, and 0.752, 0.769, and 0.760 for the validation set, respectively. DCA indicated favorable net benefit across relevant thresholds. SHAP analysis revealed that age, radiation, marital status, chemotherapy, surgery, race, and sex are the key drivers in descending order of importance. Based on RSF risk scores, significant differences in CSS were observed among the groups (log-rank p < 0.001). A Shiny-based web tool was developed for personalized prediction.

Conclusion

The RSF model with SHAP interpretation offers an accurate, user-friendly tool for individualized CSS prediction in BC- SPC and supports precision risk management.

Keywords

Breast cancer cardiovascular-specific survival machine learning second primary cancer SHapley additive exPlanations

Introduction

Breast cancer (BC)is one of the most common malignancies among women worldwide. Advances in diagnostic and treatment have significantly extended patient survival. However, this progress has also led to long-term complications and the risk of second primary cancers (SPCs) as critical determinants of prognosis.^1,2 Cohort studies have shown that the incidence of SPCs among BC survivors is markedly higher than in the general population, with pathogenesis involving a combination of genetic susceptibility, treatment-related toxicity, and environmental exposures.³ For instance, breast cancer survivors carrying BRCA1 and BRCA2 pathogenic variants are at high risk of SPCs⁴; radiotherapy and chemotherapy, by inducing DNA damage or immunosuppression, may increase the incidence of SPCs such as lung cancer and hematologic malignancies.⁵ At the same time, comprehensive breast cancer treatments—including anthracycline-based chemotherapy, radiotherapy, and targeted therapies—while improving survival rates, can also induce cardiotoxicity, leading to myocardial injury and metabolic disturbances, thereby raising the risk of cardiovascular disease (CVD) mortality, which has become the leading cause of non–cancer mortality among BC survivors.⁶ Therefore, proactive prevention and management of cardiovascular complications have become essential components of clinical care.

However, dynamic interactions between clinical baseline characteristics and treatment-induced toxicities constitute a multifaceted risk network. This complexity not only influences the occurrence of SPCs but also exacerbates the development and adverse outcomes of CVD. Consequently, SPC risk and cardiovascular complications in breast cancer patients become intertwined, greatly increasing the complexity of clinical decision-making and prognostic assessment. Their combined effects pose a dual threat to long-term patient health.^7,8 However, existing research has predominantly focused on SPC or CVD risk in isolation, lacking systematic investigation of cardiovascular mortality risk in BC patients with SPC (BC-SPC). Particularly, no dynamic prediction model integrating clinical characteristics and treatment parameters has yet been established.^9,10

Traditional statistical methods exhibit significant limitations in handling high-dimensional data, nonlinear relationships, and complex variable interactions, making it difficult to comprehensively dissect the multifaceted risk factors influencing cardiovascular outcomes. In recent years, machine learning (ML) has been widely adopted in cancer risk prediction and prognostic assessment due to its superior data processing and modeling capabilities.¹¹ Predictive models built on algorithms such as random survival forest (RSF), extreme gradient boosting (XGBoost), and support vector machine (SVM) have been successfully applied to BC recurrence risk evaluation, multi-omics data integration, and neoadjuvant chemotherapy efficacy prediction. Compared with traditional approaches, these ML algorithms demonstrate unique advantages in feature selection, precise dynamic risk stratification, and model generalizability.^12,13

Therefore, this study utilized a prospective cohort from the U.S. Surveillance, Epidemiology, and End Results (SEER) database to systematically develop and compare the predictive performance of Cox regression model and four survival ML models (RSF, SVM, XGBoost, and DeepSurv) for cardiovascular-specific survival (CSS) in BC-SPC patients. Further, SHapley Additive exPlanations (SHAP) interpretability analysis was adopted to quantify the contribution of risk factors in the optimal model, and an interactive web-based prediction tool was developed to support clinical translation for personalized risk assessment and dynamic health management.

Patients and methods

Data source

Data was accessed and extracted from the SEER database using SEER*Stat (version 8.4.3), in accordance with the SEER Data Access Policy and the SEER Research Data Use Agreement. Currently, this database covers approximately 48% of the U.S. population and aggregates information from 21 population-based cancer registries nationwide.¹⁴ We extracted records of breast cancer patients with second primary cancers diagnosed between 2010 and 2021. As SEER is a publicly available anonymized database and this study strictly adhered to its published research guidelines, no additional ethical approval was required, and the requirement for informed consent was waived.

Study variables

Study variables included: (1) demographic characteristics (sex, marital status, race, age); (2) tumor characteristics (grade, laterality, histology, TNM stage according to the AJCC 7th edition); (3) hormone receptor status (human epidermal growth factor receptor 2 [HER2], progesterone receptor [PR], estrogen receptor [ER]); (4) treatment modalities (surgery, radiotherapy, chemotherapy); (5) outcome measures (cardiovascular-specific death status, survival time). Treatment information was extracted from SEER “first course of therapy” fields. Definition of SPC was defined strictly adhering to the SEER Multiple Primary and Histology Coding Rules. Patients were included only if they had a first primary malignancy of BC followed by a distinct second primary malignancy (Sequence Number=02). The index date was the diagnosis of the SPC. Follow-up for all patients began at this time point and continued until CVD-specific death, censoring, or the end of study period. Figure 1 shows the study design timeline.

Figure 1.

Study design timeline. BC, breast cancer; SPC, second primary cancer.

Inclusion criteria were:(1) pathologically confirmed primary invasive breast cancer diagnosed during 2010–2021;(2) age ≥18 years at diagnosis; (3) survival time ≥12 months post-diagnosis. Exclusion criteria comprised:(1) ≥ 3 primary malignancies; (2)single primary breast cancer;(3) non-histologically confirmed diagnosis;(4) Stage IV;(5) unevaluable histologic grade;(6) undergoing prophylactic contralateral mastectomy;(7) missing key variables.

Statistical methods

Statistical analyses, modeling, and evaluations were performed using R software (version 4.4.2). Patients with BC-SPC were randomly stratified into training and validation sets in a 7:3 ratio. Categorical variables were compared using the χ² test or Fisher's exact test, with a two-sided p < 0.05 considered statistically significant.

Based on the training cohort, predictors were selected via LASSO regression, and a prognostic nomogram was constructed using multivariable Cox proportional hazards modeling. The remaining four models (RSF, SVM, XGBoost, and DeepSurv) underwent hyperparameter tuning in the training set via grid search combined with 10-fold cross-validation to identify the optimal parameters.

The performances of the five models were evaluated in both the training and validation sets. The following metrics were used: the concordance index (C-index) and time-dependent areas under the receiver operating characteristic curves (AUCs) at 5-, 8-, and 10-year intervals to assess each model's discriminative ability. Calibration curves were used to evaluate the agreement between predicted and observed outcomes. Decision curve analysis (DCA) was performed to calculate the clinical net benefit of each model. The workflow for model development and validation is shown in Figure 2.

Figure 2.

Patient screening and study design flowchart. SPBC, single primary breast cancer; RSF, random survival forest; SVM, support vector machine; XGBoost, extreme gradient boosting; ROC, receiver operating characteristic; DCA, decision curve analysis; SHAP, SHapley Additive exPlanations.

After identifying the optimal model by comparing the various metrics, individualized risk scores for each patient were calculated. Patients were stratified according to their risk scores. Survival analysis was performed using Kaplan–Meier curves, and differences between groups were compared with the log-rank test.

SHAP analysis was employed to interpret the predictive mechanisms of the optimal machine learning model. In the SHAP summary plot, the x-axis represents the magnitude of SHAP values, reflecting both the strength and directionality of variable contributions to model outputs. The color of the points maps to the original feature values, revealing non-linear relationships between feature magnitudes and their corresponding contribution scores.

The individualized prediction consists of a survival probability curve and local SHAP explanation plots, providing prognostic assessment from survival expectancy and risk factor contributions. First, the survival probability for each patient is calculated based on non-parametric methods. Second, local SHAP plots are used to analyze the predictive contributions of key variables to specific individuals, revealing quantitative associations between risk factors and prognosis. Finally, leveraging the optimal machine learning model, a web-based calculator using the Shiny framework is developed to dynamically predict CSS in BC-SPC patients.

Results

Clinical characteristics of BC-SPC patients

This study included 22,814 breast cancer patients, comprising a training cohort (n = 15,970) and a validation cohort (n = 6844). During follow-up, 565 CVD-specific deaths occurred. The cumulative incidence of CVD death was 1.29%, 3.06%, and 4.30% at 5, 8, and 10 years, respectively. Baseline characteristics demonstrated balanced distributions between the training and validation cohorts across demographic variables (sex, age, race, marital status), tumor features (grade, laterality, histologic type, pTNM stage), treatment modalities (surgery, radiotherapy, chemotherapy), and hormone receptor status (ER, PR, HER2), with all intergroup comparisons showing p > 0.05. In the overall cohort, 98.97% were female, 64.96% were aged ≥ 60 years, invasive ductal carcinoma accounted for 84.29%, TNM stage I was the most common (57.63%), and 97.19% underwent surgery (Table 1).

Table 1.

Characteristics of BC-SPC patients in training and validation set.

Variables	Total(n = 22,814)	Training set(n = 15,970)	Validation set(n = 6844)	p
Sex, n (%)				0.853
Male	236 (1.03)	167 (1.05)	69 (1.01)
Female	22,578 (98.97)	15,803 (98.95)	6775 (98.99)
Age, n (%)				0.461
< 60	7993 (35.04)	5620 (35.19)	2373 (34.67)
≥ 60	14,821 (64.96)	10,350 (64.81)	4471 (65.33)
Race, n (%)				0.734
White	18,318 (80.29)	12,813 (80.23)	5505 (80.44)
Black	2410 (10.56)	1703 (10.66)	707 (10.33)
Other	2086 (9.14)	1454 (9.1)	632 (9.23)
Marital, n (%)				0.2
Single	3381 (14.82)	2337 (14.63)	1044 (15.25)
Married	12,716 (55.74)	8961 (56.11)	3755 (54.87)
Other	6717 (29.44)	4672 (29.25)	2045 (29.88)
Grade, n (%)				0.111
I	5987 (26.24)	4225 (26.46)	1762 (25.75)
II	10,559 (46.28)	7319 (45.83)	3240 (47.34)
III	6268 (27.47)	4426 (27.71)	1842 (26.91)
Laterality, n (%)				0.333
Left	11,647 (51.05)	8187 (51.26)	3460 (50.56)
Right	11,167 (48.95)	7783 (48.74)	3384 (49.44)
Histology, n (%)				0.312
Ductal	19,229 (84.29)	13,441 (84.16)	5788 (84.57)
Lobular	2287 (10.02)	1596 (9.99)	691 (10.1)
Other	1298 (5.69)	933 (5.84)	365 (5.33)
TNM, n (%)				0.487
I	13,147 (57.63)	9204 (57.63)	3943 (57.61)
II	7479 (32.78)	5212 (32.64)	2267 (33.12)
III	2188 (9.59)	1554 (9.73)	634 (9.26)
Surgery, n (%)				0.693
No	640 (2.81)	443 (2.77)	197 (2.88)
Yes	22,174 (97.19)	15,527 (97.23)	6647 (97.12)
Radiation, n (%)				0.127
No/Unknown	9578 (41.98)	6652 (41.65)	2926 (42.75)
Yes	13,236 (58.02)	9318 (58.35)	3918 (57.25)
Chemotherapy, n (%)				0.094
No/Unknown	15,069 (66.05)	10,493 (65.7)	4576 (66.86)
Yes	7745 (33.95)	5477 (34.3)	2268 (33.14)
ER status, n (%)				0.149
Negative	3305 (14.49)	2361 (14.78)	944 (13.79)
Positive	19,278 (84.5)	13,447 (84.2)	5831 (85.2)
Borderline/Unknown	231 (1.01)	162 (1.01)	69 (1.01)
PR status, n (%)				0.238
Negative	5483 (24.03)	3886 (24.33)	1597 (23.33)
Positive	17,044 (74.71)	11,880 (74.39)	5164 (75.45)
Borderline/Unknown	287 (1.26)	204 (1.28)	83 (1.21)
HER2 status, n (%)				0.08
Negative	19,329 (84.72)	13,513 (84.61)	5816 (84.98)
Positive	2558 (11.21)	1829 (11.45)	729 (10.65)
Borderline/Unknown	927 (4.06)	628 (3.93)	299 (4.37)

ER: estrogen receptor; PR: progesterone receptor; HER2: human epidermal growth factor receptor 2.

LASSO regression for variable screening

Fourteen variables were initially included in the analysis. Using the training dataset, LASSO regression with 10-fold cross-validation was conducted for variable selection. Finally, the set of 7 variables with non-zero coefficients selected at the log(λ) value corresponding to one standard error above the minimum (−5.747) was identified as the final set of selected features. These variables include Sex, Age, Race, Marital status, Surgery, Radiation, and Chemotherapy, as shown in Figure 3.

Figure 3.

LASSO regression for variable selection. (A)10-fold cross-validation analysis; (B) Coefficient path plot. LASSO, least absolute shrinkage and selection operator.

Prediction model construction

RSF Model: The optimal hyperparameter combination for the model was determined through grid search, with ntree set to 400, mtry to 3, nodesize to 30, and other hyperparameters retaining their default settings.

Cox regression model: Sex, Age, Race, Marital status, Surgery, Radiation, and Chemotherapy were identified as independent prognostic factors for BC-SPC patients in the multivariable Cox regression analysis. The results are presented in Table 2, and the corresponding nomogram is shown in Figure 4.

Figure 4.

Nomogram for Cox regression. CSS, cardiovascular-specific survival.

Table 2.

Multivariate cox regression analysis of the training set.

Variables	HR	95%CI	p
Sex
Male	Ref.
Female	0.370	0.202–0.679	0.001
Age
<60	Ref.
≥60	4.872	3.480–6.820	<0.001
Race
White	Ref.
Black	1.482	1.107–1.985	0.008
Other	0.869	0.586–1.289	0.485
Marital
Single	Ref.
Married	0.679	0.498–0.925	0.014
Other	1.220	0.899–1.656	0.202
Surgery
No	Ref.
Yes	0.471	0.290–0.764	0.002
Radiation
No/Unknown	Ref.
Yes	0.551	0.450–0.675	<0.001
Chemotherapy
No/Unknown	Ref.
Yes	0.608	0.471–0.785	<0.001

Ref: Reference; CI: confidence interval; HR: hazard ratio.

XGBoost Model: The optimal hyperparameter combination was determined through grid search, with max_depth set to 4, eta to 0.1, lambda to 0.5, and other hyperparameters retaining their default settings.

SVM Model: The optimal hyperparameter combination was determined through grid search, with kernel set to “RBF kernel”, gamma.mu to 0.12, and other hyperparameters retaining their default settings.

DeepSurv Model: The optimal hyperparameter combination was determined through grid search, with alpha set to 0.1, learning rate to 0.008, and other hyperparameters retaining their default settings.

Model evaluation and interpretation

The C-index was compared across training and validation sets to evaluate the discriminative ability of each model. The C-indices for the RSF, COX, XGBoost, DeepSurv, and SVM models (hereafter “the five models”) were 0.749, 0.733, 0.740, 0.725, and 0.692 in the training set, and 0.752, 0.752, 0.750, 0.747, and 0.701 in the validation set, respectively. Based on C-index comparisons, the RSF, Cox, and XGBoost model demonstrated comparable discriminative performance overall, while the DeepSurv model exhibited slightly lower performance, and the SVM model showed the poorest discriminative ability.

To further evaluate the discriminative ability of the five models at three time points (5, 8, and 10 years), time-dependent ROC curves were employed. Analysis of the training set revealed: at the 5-year mark, the RSF model demonstrated optimal predictive performance, followed closely by XGBoost; at 8 years, the ROC curve of the RSF model was marginally lower than that of XGBoost; by year 10, XGBoost maintained its advantage with RSF and COX models ranking subsequently. In the validation set, the RSF model consistently achieved the best predictive performance across all three time points, followed by XGBoost model. Overall, these results indicate that the RSF model exhibited optimal discriminative performance (Figure 5). Calibration curves were employed to assess the agreement between the predicted and actual outcomes of the five models at the 5-year, 8-year, and 10-year time points. The results showed that, in both the training and validation cohorts, all models exhibited comparable calibration performance (Figure 6).

Figure 5.

ROC analysis of five machine learning models. (A1) Training set – 5 year; (A2) Training set – 8 year; (A3) Training set – 10 year; (B1) Validation set – 5 year; (B2) Validation set – 8 year; (B3) Validation set – 10 year. ROC, receiver operating characteristic; AUC, area under curve, RSF, random survival forest; SVM, support vector machine; XGBoost, extreme gradient boosting.

Figure 6.

Calibration curves of five machine learning models. (A1) Training set – 5 year; (A2) Training set – 8 year; (A3) Training set – 10 year; (B1) Validation set – 5 year; (B2) Validation set – 8 year; (B3) Validation set – 10 year. RSF, random survival forest; SVM, support vector machine; XGBoost, extreme gradient boosting.

DCA of the five models in the training set revealed that the RSF model consistently demonstrated the highest clinical net benefit at the 5-year, 8-year, and 10-year time points. In the validation set, the RSF model achieved optimal clinical net benefit specifically at the 10-year time point, whereas its performance at the 5- year and 8-year time points was comparable to that of the other models (Figure 7).

Figure 7.

DCA curves for five machine learning models. (A1) Training set – 5 year; (A2) Training set – 8 year; (A3) Training set – 10 year; (B1) Validation set – 5 year; (B2) Validation set – 8 year; (B3) Validation set – 10 year. DCA, decision curve analysis; RSF, random survival forest; SVM, support vector machine; XGBoost, extreme gradient boosting.

The comprehensive evaluation indicates that the RSF model is the optimal tool for predicting CSS in BC-SPC patients. SHAP framework was subsequently employed to analyze and visualize the RSF model. The SHAP summary plot revealed that variables in the model were ranked by importance in descending order as follows: Age, Radiation, Marital status, Chemotherapy, Surgery, Race, and Sex (Figure 8).

Figure 8.

SHAP summary plot for the RSF model. SHAP, SHapley Additive exPlanations; RSF, random survival forest.

RSF-based risk stratification of BC-SPC patients

Based on the risk scores calculated by the RSF model for patients in the training set, the Additive Forward Search (AddFor) algorithm was employed to generate candidate cutoff points. Using the maximization of the C-index as the criterion, two optimal cutoff values (24.2 and 46.2) were identified. Based on these thresholds, patients in both the training and validation sets were stratified into high-risk (risk score > 46.2), intermediate-risk (24.2 ≤ risk score ≤ 46.2), and low-risk (risk score < 24.2) groups. The Kaplan-Meier survival curves and log-rank tests demonstrated a statistically significant difference in CSS among the three groups (p < 0.0001, Figure 9).

Figure 9.

Risk stratification of the RSF model on the training and validation sets. (A) Training set risk stratification; (B) Validation set risk stratification.

Individualized prognostic interpretation for BC-SPC patients

Three BC-SPC patients were randomly selected and sequentially numbered for individualized prognostic illustration. As depicted in Figure 10(a), Patients 1 and 2 exhibited more favorable survival probabilities than Patient 3. Local SHAP analysis explained each patient's prognosis from the perspective of variable contributions, where red-highlighted variables represented positive contributions to the predicted CVD-mortality risk, while blue-highlighted variables indicated negative contributions.

Figure 10.

Individualized prognosis prediction for BC-SPC patients. (A) Estimated survival probability curves for three patients; (B) Local SHAP plot for patient 1; (C) Local SHAP plot for patient 2; (D) Local SHAP plot for patient 3. BC-SPC, breast cancer with second primary cancer; SHAP, SHapley Additive exPlanations.

The local SHAP plot for Patient 1 demonstrated that radiation, marital status, and surgery served as features associated with lower CVD-mortality risk, whereas age and chemotherapy were linked to higher risk. Additionally, the contribution of sex and race was minimal in this example (Figure 10(b)). Patient 2 (Figure 10(c)) exhibited a pattern consistent with Patient 1: radiation, marital status, and surgery were linked to lower risk prediction, whereas age and chemotherapy were associated with increased risk, with sex and race exhibiting minimal impact. For Patient 3, marital status, age, and chemotherapy were associated with a higher CVD-mortality risk, whereas radiation and surgery were linked to lower risk. The contributions of sex and race remained minor in this example (Figure 10(d)).

Web-based calculator for the optimal model

To facilitate the clinical translation of the research findings, we developed a web-based RSF model prediction tool using the Shiny framework (https://webcalcula.shinyapps.io/RSF-model_pro/) for the dynamic prognostic assessment of CSS in BC-SPC patients. This tool integrates an interactive interface, real-time computation, and visualization capabilities to dynamically generate individualized survival probability curves, survival rates at specific time points, and individual-level SHAP explanations, thereby providing clinicians with precise and convenient risk-quantification support. Its user-friendly design, highly adapted to clinical settings, significantly enhances the implementation of personalized health management for BC-SPC patients.

Discussion

This study systematically developed and validated a CSS prediction model using a cohort of BC-SPC patients from the SEER database. By comparing Cox regression and four machine learning algorithms, the RSF model demonstrated superior predictive performance. SHAP interpretability analysis identified age, radiotherapy, marital status, chemotherapy, and surgery as dominant predictors for CSS. Ultimately, an online dynamic prediction tool developed using the Shiny framework enables real-time assessment of CSS risk, providing reliable technical support for precision clinical decision-making.

Early identification of cardiovascular mortality risk in BC-SPC patients is critical for improving quality of life and reducing comorbidity management costs. Although existing studies have shown that ML algorithms outperform traditional statistical methods in assessing CVD risk,¹⁵ predictive modeling in the field of cardio-oncology remains exploratory.¹⁶ The RSF model established in this study demonstrated significantly superior discriminative ability to other models in 5-, 8-, and 10-year predictions, as evidenced by higher C-index and time-dependent AUC values. Additionally, the calibration curves revealed favorable predictive consistency. Further validation through DCA confirmed that the RSF model provides enhanced clinical net benefit.This may stem from the RSF model's capability to effectively capture nonlinear relationships and higher-order interactions among variables by integrating multiple decision trees.^17,18 Moreover, its tree-based risk stratification mechanism has shown excellent performance in numerous cancer-prognosis studies.^19,20 This study not only confirms the superiority of the RSF model for CSS prediction in BC-SPC patients but also provides important guidance for selecting prognostic models in clinical practice. With its outstanding predictive accuracy and capacity to handle complex clinical data, RSF emerges as the ideal tool for prognostic assessment in this patient population.

Concurrently, the XGBoost model also performed exceptionally well in long-term survival prediction, particularly for 10-year survival, with discriminative power closely approximating that of the RSF model. As a gradient boosting tree algorithm, XGBoost likewise excels at handling high-dimensional data and has demonstrated high accuracy and reliability in various clinical prediction tasks.^21,22 Additionally, the DeepSurv model showed noteworthy predictive performance at certain time points, indicating its potential applicability in specialized scenarios.²³ Future research could explore strategies to integrate the strengths of multiple models, thereby enhancing predictive accuracy and clinical utility.^16,24

Although ML models offer advantages in predictive accuracy, their “black-box” nature still limits widespread clinical adoption. To enhance model transparency and clinical utility, this study implemented SHAP interpretability analysis to elucidate the decision-making processes of the RSF model. By quantifying the contribution of predictors to individual survival risk, SHAP analysis not only identified key factors associated with CSS risk in BC-SPC patients but also clarified the direction of their contribution, substantially enhancing the model's clinical interpretability and practical utility.²⁵ SHAP analysis revealed that age, radiotherapy, marital status, chemotherapy, and surgery are the core predictors of CSS. Among these, age was identified as the primary risk factor, with its effects likely associated with age-related impairment of vascular endothelial function and cumulative toxicity from anticancer therapies.^26,27 Previous studies have also demonstrated that cardiovascular-specific cumulative mortality progressively increases with advancing diagnostic age and extended follow-up duration. Particularly among patients aged ≥75 years, CVD cumulative mortality surpasses that of breast cancer itself at approximately 12 years post-diagnosis, underscoring the imperative for long-term cardiovascular risk management strategies in this older adult population.²⁸

Regarding treatment variables, SHAP analysis revealed that radiotherapy, chemotherapy, and surgery were all associated with a lower risk of CVD mortality. This phenomenon may stem from complex bidirectional interactions between tumors and the cardiovascular system mediated through inflammatory factors, metabolic abnormalities, and neuroendocrine signaling. Research has found that endothelin-1 (ET-1) secreted by breast tumor cells serves not only as a key factor in breast cancer progression and metastasis but also as a critical mediator of myocardial hypertrophy. Eliminating tumor cells reduces circulating ET-1 levels, thereby alleviating its pro-hypertrophic effects on the heart.^29,30 Similarly, resection of primary tumor lesions might mitigate long-term chronic cardiovascular damage caused by tumor-related systemic inflammatory responses. Regarding radiotherapy, modern precision techniques such as intensity-modulated radiotherapy (IMRT) and deep inspiration breath hold (DIBH) have significantly reduced cardiac radiation exposure, potentially minimizing harm while maintaining tumor control.^31–33 In contrast, the contribution of surgery appears relatively modest, potentially due to its threshold effect for benefit. Although surgery can indirectly prevent chemotherapy-induced cardiotoxicity by reducing the required cycles and dosages of adjuvant chemotherapy, the stress response from overly extensive surgical procedures (such as total mastectomy) may partially counteract its protective effects.^34–36 In summary, the association between anticancer treatments and cardiovascular outcomes is complex. These observations may also, to some extent, reflect that patients receiving active treatment possess better physiological reserve and fewer comorbidities. Therefore, personalized risk-benefit assessments should be conducted when formulating systemic anticancer regimens, carefully weighing antitumor efficacy against potential cardiovascular risks. Through multidisciplinary collaboration, comprehensive strategies integrating precision therapy and cardioprotection should be developed to maximize antitumor efficacy while avoiding compound toxicities, ultimately optimizing patient survival and quality of life.^6,8,37

Marital status was also identified as a significant prognostic factor, with being married associated with better outcomes. Studies indicate that the spousal support available to married patients not only helps alleviate psychological stress, thereby suppressing excessive sympathetic activation and elevated inflammatory markers, but also promotes treatment adherence and ensures timely cardiac function monitoring. This finding substantiates the modulatory influence of psychosocial factors on cardiovascular prognosis in cancer survivors, providing a theoretical basis for integrating social support interventions into clinical practice.^38–40

In summary, the SHAP analysis has substantially enhanced the transparency and interpretability of the RSF model by elucidating both the magnitude and direction of feature impacts, successfully transforming this high-accuracy machine learning tool into a clinically intelligible and trustworthy decision aid. Our study not only validates and refines findings from existing literature on key clinical predictors but also provides visual, user-friendly, and easily interpretable risk assessments to support personalized patient management and clinical decision-making. Leveraging the model's interpretable outputs, clinicians can swiftly identify high-risk patients and focus more precisely on key factors when formulating treatment strategies and follow-up plans. For example, for elderly, unmarried patients undergoing chemotherapy, incorporating psychosocial support and enhancing multidisciplinary monitoring for cardiotoxicity may be considered. This will help achieve more precise and efficient clinical management by driving the seamless integration of artificial intelligence technology into clinical practice, ultimately improving cardiovascular outcomes for cancer survivors.

To facilitate clinical translation, we developed an interactive web-based calculator using the Shiny framework, addressing the gap in precision cardiovascular risk management tools for BC-SPC patients. The tool dynamically integrates multidimensional clinical data to generate individualized survival probabilities in real time, enabling immediate computation of CSS risk. This innovation provides clinicians with dynamic risk prediction support, such as guiding intensified cardiac monitoring for high-risk patients, enabling the timely development of personalized treatment plans and optimized follow-up strategies.

Although this study provides an effective ML model and a clinical tool for predicting CSS in BC-SPC patients, several limitations should be acknowledged. First, the study is constrained by the retrospective nature of the SEER database and its limited variable coverage. The absence of granular clinical information, such as detailed treatment regimens, cardiovascular comorbidities, and lifestyle factors, may introduce residual confounding. In addition, treatment-related variables are susceptible to confounding by indication and “healthy patient” selection bias; Thus, their apparent effects in the model may reflect an aggregate of baseline health status and disease severity rather than the causal impact of treatment itself. Moreover, despite applying strict classification rules, a small degree of misclassification between second primary cancers and recurrences may still exist, which is an inherent limitation of registry-based studies. Second, regarding statistical methodology, this study did not employ a competing-risks model for adjustment. In subgroups with a high incidence of competing events, this may overestimate the cumulative incidence of cardiovascular death; therefore, our risk prediction results should be interpreted as conditional risk stratification within specific survival contexts. Additionally, although SHAP analysis effectively quantifies feature contributions to model predictions, its interpretability remains constrained by the underlying model architecture and potential confounders; meanwhile, larger multicenter datasets will be required to support reliable subgroup-specific interpretability analyses. Thus, it cannot be used to establish direct causal relationships. Third, as the model was developed and validated exclusively using the SEER database, it currently lacks external evaluation. To confirm its utility and generalizability, further validation through multi-center external studies is required across broader geographic regions, diverse healthcare systems, and specific patient subgroups. Additionally, future research should focus on refining the model by incorporating cohorts that provide comprehensive longitudinal treatment and granular clinical information.

Conclusion

This study confirms the predictive value of ML for CSS in BC-SPC patients, with the RSF model demonstrating optimal predictive performance. An interactive web-based dynamic tool developed with key predictors enables real-time risk prediction, providing clinicians with intuitive and reliable technical support to optimize cardiovascular outcomes in BC-SPC survivors.

Supplemental Material

sj-pdf-1-dhj-10.1177_20552076261435833 - Supplemental material for Interpretable machine learning for predicting cardiovascular-specific survival in breast cancer patients with second primary cancers: A SEER-based study

Supplemental material, sj-pdf-1-dhj-10.1177_20552076261435833 for Interpretable machine learning for predicting cardiovascular-specific survival in breast cancer patients with second primary cancers: A SEER-based study by Wen Shui, Chao Lan, Xueqing Xing, Jian Wang and Huiping Liu in DIGITAL HEALTH

Supplemental Material

sj-docx-2-dhj-10.1177_20552076261435833 - Supplemental material for Interpretable machine learning for predicting cardiovascular-specific survival in breast cancer patients with second primary cancers: A SEER-based study

Supplemental material, sj-docx-2-dhj-10.1177_20552076261435833 for Interpretable machine learning for predicting cardiovascular-specific survival in breast cancer patients with second primary cancers: A SEER-based study by Wen Shui, Chao Lan, Xueqing Xing, Jian Wang and Huiping Liu in DIGITAL HEALTH

Footnotes

Acknowledgements

Thanks for the data support provided by SEER database.

ORCID iD

Huiping Liu

Ethical considerations

This study utilized data from the publicly available SEER database, where patient information is anonymized; therefore, ethics approval and informed consent were not required.

Author contributions

Wen Shui and Chao Lan: Conceptualization, Funding acquisition, Data curation, Formal analysis, Software, Visualization, Writing - original draft. Xueqing Xing: Formal analysis, Investigation, Supervision, Validation. Jian Wang: Methodology, Resources, Supervision, Writing - review & editing. Huiping Liu: Conceptualization, Methodology, Supervision, Writing - review & editing.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Natural Science Foundation of Shanxi Province, (grant number No. 202203021222387).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data and code availability

Selected non-proprietary code components and scripts necessary for the primary analyses can be shared upon reasonable requests to the corresponding author via email.

Data availability statement

The data utilized in this study were sourced from the SEER database.

Supplemental material

Supplemental material for this article is available online.

References

Allen

Hassan

Sofianopoulou

, et al. Risks of second non-breast primaries following breast cancer in women: a systematic review and meta-analysis. Breast Cancer Res 2023; 25: 18.

Strongman

Gadd

Matthews

, et al. Does cardiovascular mortality overtake cancer mortality during cancer survivorship?: An English retrospective cohort study. JACC CardioOncol 2022; 4: 113–123.

Ramin

Veiga

LHS

, et al. Risk of second primary cancer among women in the kaiser permanente breast cancer survivors cohort. Breast Cancer Res 2023; 25: 50.

Allen

Hassan

Walburga

, et al. Second primary cancer risks after breast cancer in BRCA1 and BRCA2 pathogenic variant carriers. J Clin Oncol 2025; 43: 651–661.

Liang

Qin

, et al. Risk of second primary cancer in young breast cancer survivors: an important yet overlooked issue. Ther Adv Med Oncol 2025; 17: 17588359251321904.

Cronin

Lowery

Kerin

, et al. Risk prediction, diagnosis and management of a breast cancer patient with treatment-related cardiovascular toxicity: an essential overview. Cancers (Basel) 2024; 16: 1845.

Golani

Kagenaar

Jégu

, et al. Socio-economic inequalities in second primary cancer incidence: a competing risks analysis of women with breast cancer in England between 2000 and 2018. Int J Cancer 2025; 156: 2283–2293.

Mehta

Watson

Barac

, et al. Cardiovascular disease and breast cancer: where these entities intersect: a scientific statement from the American Heart Association. Circulation 2018; 137: e30–e66.

Deng

Jones

Wang

, et al. Mortality after second malignancy in breast cancer survivors compared to a first primary cancer: a nationwide longitudinal cohort study. NPJ Breast Cancer 2022; 8: 82.

10.

Peddi

Fasching

Liu

, et al. Genetic polymorphisms and correlation with treatment-induced cardiotoxicity and prognosis in patients with breast cancer. Clin Cancer Res 2022; 28: 1854–1862.

11.

Swanson

Zhang

, et al. From patterns to patients: advances in clinical machine learning for cancer diagnosis, prognosis, and treatment. Cell 2023; 186: 1772–1791.

12.

Yao

Zhou

Jia

, et al. Machine learning prediction of pathological complete response to neoadjuvant chemotherapy with peritumoral breast tumor ultrasound radiomics: compare with intratumoral radiomics and clinicopathologic predictors. Breast Cancer Res Treat 2025; 212: 325 336.

13.

Zhang

Duan

, et al. Survival prediction in second primary breast cancer patients with machine learning: an analysis of SEER database. Comput Methods Programs Biomed 2024; 254: 108310.

14.

Friedman

Negoita

. History of the surveillance, epidemiology, and end results (SEER) program. J Natl Cancer Inst Monogr 2024; 2024: 105–109.

15.

Krittanawong

Virk

HUH

Bangalore

, et al. Machine learning prediction in cardiovascular diseases: a meta-analysis. Sci Rep 2020; 10: 16057.

16.

Al-Droubi

Jahangir

Kochendorfer

, et al. Artificial intelligence modelling to assess the risk of cardiovascular disease in oncology patients. Eur Heart J Digit Health 2023; 4: 302–315.

17.

Zhao

Nguyen

, et al. Random survival forest for predicting the combined effects of multiple physiological risk factors on all-cause mortality. Sci Rep 2024; 14: 15566.

18.

Zeng

Zhang

, et al. Mortality prediction and influencing factors for intensive care unit patients with acute tubular necrosis: random survival forest and cox regression analysis. Front Pharmacol 2024; 15: 1361923.

19.

Yang

, et al. The development of a prediction model based on random survival forest for the prognosis of non- Hodgkin lymphoma: a prospective cohort study in China. Heliyon 2024; 10: e32788.

20.

Sun

, et al. Which model is better in predicting the survival of laryngeal squamous cell carcinoma?: comparison of the random survival forest based on machine learning algorithms to cox regression: analyses based on SEER database. Medicine (Baltimore) 2023; 102: e33144.

21.

Ellmann

von Rohr

Komina

, et al. Tumor grade-titude: XGBoost radiomics paves the way for RCC classification. Eur J Radiol 2025; 188: 112146.

22.

Liang

Wang

Zhong

, et al. Perspective: global burden of iodine deficiency: insights and projections to 2050 using XGBoost and SHAP. Adv Nutr 2025; 16: 100384.

23.

Huang

Feng

, et al. Deep-learning model for predicting the survival of rectal adenocarcinoma patients based on a surveillance, epidemiology, and end results analysis. BMC Cancer 2022; 22: 10.

24.

Dong

Zhang

Duan

, et al. Development of a machine learning-based model to predict prognosis of alpha-fetoprotein-positive hepatocellular carcinoma. J Transl Med 2024; 22: 55.

25.

Lundberg

Erion

Chen

, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell 2020; 2: 56–67.

26.

Wang

Fan

Yang

, et al. Higher risk of cardiovascular mortality than cancer mortality among long-term cancer survivors. Front Cardiovasc Med 2023; 10: 1014400.

27.

North

Sinclair

. The intersection between aging and cardiovascular disease. Circ Res 2012; 110: 1097–1108.

28.

Weberpals

Jansen

Müller

, et al. Long-term heart-specific mortality among 347 476 breast cancer patients treated with radiotherapy or chemotherapy: a registry-based cohort study. Eur Heart J 2018; 39: 3896–3903.

29.

Maayah

Takahara

Alam

, et al. Breast cancer diagnosis is associated with relative left ventricular hypertrophy and elevated endothelin-1 signaling. BMC Cancer 2020; 20: 51.

30.

Maayah

Ferdaoussi

Boukouris

, et al. Endothelin receptor blocker reverses breast cancer-induced cardiac remodeling. JACC CardioOncol 2023; 5: 686–700.

31.

Kim

Park

Youn

, et al. Development and validation of a risk score model for predicting the cardiovascular outcomes after breast cancer therapy: the CHEMO-RADIAT score. J Am Heart Assoc 2021; 10: e021931.

32.

Ell

Martin

Cehic

, et al. Cardiotoxicity of radiation therapy: mechanisms, management, and mitigation. Curr Treat Options Oncol 2021; 22: 70.

33.

Falco

Masojć

Macała

, et al. Deep inspiration breath hold reduces the mean heart dose in left breast cancer radiotherapy. Radiol Oncol 2021; 55: 212–220.

34.

Loap

Fourquet

Kirova

. Survival and toxicity after breast-conserving surgery and external beam reirradiation for localized ipsilateral breast tumour recurrence: a population-based study. Cancer Radiother 2024; 28: 265–271.

35.

Stern

Kim

Smith Montes

, et al. Breast-Conserving therapy preserves sexual well-being more than postmastectomy breast reconstruction: trends, factors, and interventions. Plast Reconstr Surg 2025; 155: 407–420.

36.

Pannu

Constantinou

. Inflammation, nutrition, and clinical outcomes in breast cancer survivors: a narrative review. Curr Nutr Rep 2023; 12: 643–661.

37.

Liu

Zheng

Cai

, et al. Cardiotoxicity from neoadjuvant targeted treatment for breast cancer prior to surgery. Front Cardiovasc Med 2023; 10: 1078135.

38.

Huang

Ding

, et al. A novel nomogram for predicting long-term heart-disease specific survival among older female primary breast cancer patients that underwent chemotherapy: a real-world data retrospective cohort study. Front Public Health 2022; 10: 964609.

39.

Shrout

Renna

Leonard

, et al. Couples in breast cancer survivorship: daily associations in relationship satisfaction, stress, and health. Compr Psychoneuroendocrinol 2024; 20: 100261.

40.

Stabellini

Cullen

Moore

, et al. Social determinants of health data improve the prediction of cardiac outcomes in females with breast cancer. Cancers (Basel) 2023; 15. DOI: 10.3390/cancers15184630.