Development and validation of a machine learning model for predicting invasive breast cancer using 26 routine clinical examination indicators

Abstract

Background

Invasive breast cancer (IBC) is the most prevalent malignant tumor in women globally and a leading cause of female mortality, with increasing incidence and death rates. Recent advancements in machine learning (ML) have shown significant potential in IBC prediction. This study aimed to assess different ML strategies to develop an optimal model for predicting IBC based on routine clinical examination indicators.

Methods

We collected routine blood parameters, serum tumor marker indicators, and age data from 1,175 IBC patients at the Affiliated Dazu’s Hospital of Chongqing Medical University. From these datasets, we identified 26 key routine clinical examination indicators, including 23 blood routine parameters, 2 tumor marker indicators, and age. We constructed an IBC prediction model using 10 ML algorithms. The performance of these models was evaluated using the test set and internal validation set, with evaluation metrics including accuracy, positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity, F1 score, and area under the curve (AUC). Ultimately, an optimal web tool for predicting IBC was developed based on these models.

Results

In the internal testing cohort, we assessed ten ML models. The XGBoost-based web tools emerged as the optimal choice, achieving an AUC exceeding 0.970 on both the test set and internal validation cohorts. Interpretability analysis using Shapley additive explanations (SHAP) revealed that basophils, platelet distribution width (PDW), and age features ranked highly in the feature importance of XGBoost models for IBC prediction, highlighting the importance of incorporating routinely collected clinical data into IBC prediction models.

Conclusions

The ML-based web tool developed using 26 routine clinical examination indicators has shown considerable promise in predicting IBC. Among the models, the XGBoost algorithm exhibited the highest performance, becoming a reliable predictive tool that can enhance clinical decision-making and improve the accuracy of IBC diagnoses.

Keywords

invasive breast cancer machine learning routine blood parameters serum tumor marker indicators shapley additive explanations

Introduction

Invasive breast cancer (IBC) poses a significant threat to women’s health worldwide. According to data from the International Agency for Research on Cancer (IARC) published in GLOBOCAN 2022, approximately 2.26 million new cases of IBC are diagnosed each year globally, making it the most common cancer among women and accounting for 23.8% of all female cancer cases.¹ While the incidence rates vary by region and year, the overall trend shows a rise in IBC cases worldwide.^2,3 China reports around 300,000 new diagnoses annually, with a concerning trend toward younger onset, making premature death a major burden of the disease.^4,5

IBC is the most prevalent subtype, and early detection is crucial for improving patient outcomes. Currently, screening and diagnosis of IBC primarily rely on traditional methods such as mammography, ultrasound, and biopsy.^6–8 However, these approaches have several limitations, including low diagnostic accuracy, delayed diagnosis, high costs, and long appointment wait times, as well as the risks associated with invasive procedures and radiation exposure. Moreover, in China’s rural and under-resourced healthcare settings, misdiagnosis and missed diagnoses are more common, leading many patients to miss optimal treatment opportunities and, in some cases, experience irreversible complications.⁹ Therefore, improving early recognition of IBC and implementing effective high-risk screening in community hospitals and primary care centers has become an urgent issue.

In recent years, artificial intelligence (AI) has made significant advances in early cancer diagnosis and treatment.^10–12 While most cancer prediction models rely on multimodal data from imaging and laboratory tests, there is growing evidence that routine clinical examination indicators have significant potential in early cancer prediction, particularly when integrated with machine learning (ML) techniques.^13,14 By combining routine clinical examination indicators with ML algorithms, it is possible to develop an early warning screening model for IBC. This approach can enhance the diagnostic efficiency and accuracy in primary care settings, significantly reducing the rate of missed diagnoses. The application of ML can enable more patients, especially those in resource-limited areas, to benefit from expert-level diagnostic knowledge, thereby improving early awareness of IBC in underdeveloped regions—a matter of considerable clinical significance.

A literature review revealed that Sukhadia et al.¹⁵ developed a machine learning-based model to predict the risk of distant recurrence in invasive breast cancer (IBC) using clinicopathological data before and after treatment. Cross-institutional validation demonstrated that the random forest model performed optimally, highlighting the crucial predictive value of imaging assessment of treatment response. Barkana et al.¹⁶ developed an innovative breast mapping and scanning model, utilizing a grey-level co-occurrence matrix (GLCM) to quantify the mammographic features of inflammatory breast cancer, laying an important foundation for machine learning-assisted diagnostic models. Ben Rabah et al.¹⁷ proposed a multimodal deep learning model that integrates mammographic images with clinical metadata to achieve non-invasive classification of IBC subtypes. This research provides an AI-driven, innovative approach for personalized IBC diagnosis and treatment. Therefore, there are currently no studies that have reported the development of ML-based high-risk screening models using routine clinical examination indicators for IBC. This study aims to construct and evaluate various ML-based models for predicting IBC by leveraging routine clinical examination indicators, ultimately identifying the most effective model to aid in the early identification of high-risk patients and optimize their treatment timelines.

Materials and methods

Data sources and study population

This study retrospectively analyzed 1,175 female patients with invasive breast diseases who first visited the Affiliated Dazu’s Hospital of Chongqing Medical University between January 1, 2018, and December 31, 2023. Routine clinical examination indicators from 131 IBC patients and 355 non-IBC patients who first visited between January 1, 2018, and December 31, 2019, were selected for the internal validation cohort. Indicators from 305 IBC patients and 384 non-IBC patients who first visited between January 1, 2020, and December 31, 2023, were selected for the model establishment cohort and test set cohort.

The model establishment cohort was used for feature selection, hyperparameter tuning, and model development, while the internal validation cohort and test set cohort evaluated the model’s performance. The inclusion criteria were as follows: 1) Patients confirmed to have IBC through pathological examination; 2) Complete routine clinical examination indicators; 3) Patients who had not received any treatment before the IBC diagnosis. Exclusion criteria were: 1) Patients with incomplete routine clinical examination indicators; 2) IBC patients with comorbidities. Non-IBC patients diagnosed during the same period were selected as the control group.

Data pre-processing

In this study, we collected 38 routine clinical examination indicators that are both cost-effective and widely available, including routine blood parameters, electrolytes, tumor markers, ferritin levels, and age. Indicators with more than 25% missing data in the entire dataset were excluded from model training, leaving a final selection of 26 routine clinical examination indicators for analysis. These included hemoglobin (Hb; g/L), hematocrit (Hct; %), mean corpuscular hemoglobin (MCH; pg), mean corpuscular hemoglobin concentration (MCHC; g/L), mean corpuscular volume (MCV; fL), mean platelet volume (MPV; fL), platelet large cell ratio (P-LCR; %), platelet distribution width (PDW; %), plateletcrit (PCT; %), platelet count (PLT; 10⁹/L), red blood cell count (RBC; 10¹²/L), white blood cell count (WBC; 10⁹/L), neutrophil percentage (Neut%; %), lymphocyte percentage (Lymph%; %), monocyte percentage (Mono%; %), eosinophil percentage (Eos%; %), basophil percentage (Baso%; %), absolute neutrophil count (Neut#; 10⁹/L), absolute lymphocyte count (Lymph#; 10⁹/L), absolute monocyte count (Mono#; 10⁹/L), absolute eosinophil count (Eos#; 10⁹/L), absolute basophil count (Baso#; 10⁹/L), red cell distribution width (RDW-CV; %), glycan antigen 15-3 (CA15-3; U/mL), carcinoembryonic antigen (CEA; ng/mL), and age (years).

Statistical analysis

Throughout the model development process, we used Python 3.11 as the programming environment, incorporating libraries such as Scikit-learn 1.4.2 for ML, SHAP 0.45.1 for model interpretability, Matplotlib 3.8.2 for visualization, Pandas 2.2.2 for data handling, and NumPy 1.26.3 to ensure efficient and accurate completion of ML tasks.

This study statistically analyzed all data from IBC, including the distribution of demographic characteristics and routine laboratory parameters in the internal validation cohort and the model establishment and test set cohorts. Differences in age, tumor markers, and routine blood tests between the IBC and non-IBC groups were calculated, along with the corresponding means, standard deviations (SD), medians, interquartile ranges, and p-values. Continuous variables were compared using analysis of variance (ANOVA), with p-values adjusted using the false discovery rate (FDR) method. A p-value less than 0.05 was considered statistically significant between the IBC and non-IBC groups.

Performance of different models

To develop an IBC prediction model, we employed ten different ML algorithms: support vector machine (SVM), multilayer perceptron (MLP), logistic regression (LR), K-nearest neighbors (KNN), decision tree (DT), random forest (RF), Xtreme gradient boosting (XGBoost), gradient boosting decision tree (GBDT), light gradient boosting machine (LightGBM), and adaptive boosting (AdaBoost). To enhance model performance, we used a random search method for hyperparameter tuning, selecting the area under the curve (AUC) as the primary evaluation metric.

After optimization, we evaluated the models using stratified 10-fold cross-validation to test their generalization ability on new data. Compared to traditional n-fold cross-validation, stratified 10-fold cross-validation is particularly well-suited for imbalanced datasets, ensuring that each fold maintains the same class proportions as the overall dataset, thereby improving consistency and evaluation stability. Additionally, we used the bootstrap method to calculate confidence intervals for the evaluation metrics.

Finally, we applied the optimized parameters to train and test different data groups, constructing auxiliary diagnostic models, which were then evaluated based on their practical performance in IBC diagnosis. The model development process is illustrated in Figure S1.

Model validation

During the model validation phase, we utilize samples from the test set queue and the internal validation queue to further evaluate the performance of the model, which will assess the key performance indicators of each model.

After optimizing and training various models, several key performance indicators were calculated, including accuracy, positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity, recall, F1 score, and AUC. Accuracy represents the proportion of samples for which the model’s predictions match the true labels. PPV and NPV reflect the model’s reliability in predicting different classes. Sensitivity and specificity measure the model’s ability to correctly identify positive and negative samples, respectively. Recall, which is identical to sensitivity, assesses the model’s capability to correctly classify positive samples. The F1 score, the harmonic mean of precision and recall, provides a balanced evaluation of both metrics. Finally, the AUC is a comprehensive metric that evaluates the model’s overall performance at various thresholds. An AUC value closer to 1 indicates a stronger ability of the model to distinguish between positive and negative samples.

By comparing and analyzing these evaluation metrics, we can gain a deeper understanding of the strengths and weaknesses of each model, allowing us to select the most appropriate model for this IBC dataset (Figure S2).

System development

To improve clinicians’ accuracy in the early screening of IBC, we developed a web-based tool that integrates routine blood parameters, tumor marker indicators, age, and the optimal ML model. The system’s homepage displays daily statistics on hospital visits and patient data (Figure S3). Doctors can log into the system to view all test data for each patient (Figure S4). By inputting a patient’s personal information, doctors can generate an IBC diagnosis report to use as a reference (Figures S5-S6). Clinicians can then decide, based on their clinical experience, whether to adopt the system’s predicted diagnosis and confirm their decision within the system.

Results

Patient characteristics and variables

In the internal validation cohort, a total of 486 patients were included, comprising 131 patients with IBC and 355 patients with non-IBC. In the model establishment and test set cohorts, there were a total of 698 patients, including 305 patients with IBC and 384 patients with non-IBC. In the internal validation cohort, several variables demonstrated highly significant differences between the positive and negative groups, with p-values less than 0.001. These variables included age, Baso#, Baso%, Hb, Hct, Lymph#, Lymph%, MPV, Mono%, P-LCR, and PCT. Additionally, CA15-3, CEA, Neut%, and PLT had p-values below 0.05, indicating statistically significant differences. In contrast, variables such as Eos#, Eos%, MCH, MCHC, MCV, Neut#, RDW-CV, and WBC did not show statistically significant differences (Table S1). In the model establishment and test set cohorts, age, CA15-3, CEA, Baso#, Baso%, Hb, Hct, Lymph#, Lymph%, MPV, Mono#, and Mono% all exhibited p-values below 0.001, highlighting their strong discriminatory power between IBC and non-IBC cases. MCV and Neut# also showed significant differences, with p-values less than 0.05. However, Eos#, Eos%, MCH, MCHC, and WBC did not exhibit statistically significant differences (Table S2).

Evaluation of the predictive performance of different models for IBC group and non-IBC group in the test set cohort

As shown in Table S3, validation using the test set cohort revealed that all ten ML algorithms achieved AUC values greater than 0.920, indicating high classification performance. The XGBoost model achieved a peak AUC of 0.975 (95% CI: 0.945-1.000) (Figure 1(a)). Furthermore, among the other six evaluation metrics, the XGBoost model also achieved the highest values for accuracy and F1 score, at 0.935 (95% CI: 0.892-0.970) and 0.934 (95% CI: 0.891-0.964), respectively. Its NPV and sensitivity values were also good, at 0.967 (95% CI: 0.945-0.990) and 0.961 (95% CI: 0.928-0.987), respectively (Figure 2(a)). Overall, the XGBoost model consistently achieved AUCs above 0.910 across all evaluation metrics, demonstrating superior performance compared to other models. Therefore, the XGBoost model was established as the optimal choice for IBC prediction.

Figure 1.

ROC curves of 10 ML models. (a) The AUC value results in the test set cohort. (b) The AUC value results in the internal validation cohort. The horizontal axis is the False Positive Rate (FPR), the vertical axis is the True Positive Rate (TPR), and the area under each curve is the AUC value of the model, which is used to measure the overall performance of the model.

Figure 2.

Results of six evaluation metrics across 10 ML models. The figure shows the model’s performance on the test set (a) and internal validation (b). (a) Line chart. Different colors represent different models; the horizontal axis represents the evaluation metric, and the vertical axis represents the value of the evaluation metric. (b) Bar chart. Different colors represent different evaluation metrics; the horizontal axis represents multiple models, and the vertical axis represents the evaluation value.

Evaluation of the predictive performance of different models for IBC group and non-IBC group in internal validation cohort

As shown in Table S4, we further evaluated the performance of our models using the internal validation cohort. By comparing seven evaluation metrics across ten ML models, we found that the XGBoost model performed exceptionally well on all of them. To comprehensively analyze the ROC curves and AUC values of these ten models, we plotted the ROC curves, as shown in Figure 1(b). All ten models achieved an AUC value exceeding 0.920, indicating excellent classification performance. Notably, the XGBoost model had the highest AUC value, reaching 0.982 (95% CI: 0.970-0.992). And this model significantly outperformed other comparative models in four key metrics: accuracy, NPV, sensitivity, and F1 score, with values of 0.947 (95% CI: 0.922, 0.967), 0.960 (95% CI: 0.938, 0.977), 0.885 (95% CI: 0.823, 0.938), and 0.898 (95% CI: 0.855, 0.934), respectively (Figure 2(b)).

Analysis of model interpretability

Shapley additive explanations (SHAP) is a powerful tool for interpreting ML models and assessing the importance of each feature in relation to model predictions. According to Figure 3, the top 10 features of the XGBoost model, ranked by importance, are Baso%, Baso#, PDW, age, Mono%, CA15-3, PCT, Lymph#, P-LCR, and MPV. The influence of these features remains generally consistent across the various models. Notably, the XGBoost model is mainly affected by the basophil index in regular blood testing, which is crucial for the early detection of IBC.

Figure 3.

Visualization of SHAP values plot for XGBoost, the top-performing machine learning model. The SHAP values of the top ten features of different routine clinical examination indicators in the early prediction of IBC and non-IBC groups are shown, reflecting their importance to the model prediction.

Discussion

Principal results

IBC remains a significant threat to the physical and mental health of women, particularly in China, where its incidence continues to rise.^18–20 Early diagnosis of IBC is crucial for improving patient prognosis and alleviating the burden of the disease. This study successfully developed and validated a new method for predicting IBC using an ML model based on 26 routine clinical examination indicators, while also exploring its clinical application value.

The core finding of this study is that the XGBoost model exhibits extremely high discriminative performance (AUC > 0.970) in both the independent test set and internal validation cohort, confirming its ability to reliably distinguish between IBC and non-IBC patients. This performance not only significantly outperforms traditional clinical diagnostic methods but also surpasses nine other comparative ML models, highlighting its modeling advantages in complex medical data. XGBoost’s superior performance is mainly due to two factors. First, it automatically identifies key feature combinations through the gradient boosting tree algorithm, a nonlinear relationship that traditional statistical models struggle to capture. Second, regularization strategies and cross-validation effectively reduce the risk of overfitting, ensuring the model’s stability on external data. Notably, the model’s high sensitivity and NPV suggest that it can reliably identify patients who do not have the disease, thereby reducing unnecessary follow-up examinations and alleviating patient anxiety.^21–23 Additionally, feature importance analysis using SHAP values revealed that routine clinical examination indicators significantly contributed to the model’s predictions.^24–26 Among these, the Baso% indicator emerged as the most influential feature; the emergence of inflammatory responses in the body may be directly linked to abnormalities in regular blood inflammatory indicators, underscoring its potential application in clinical practice. Our research also identified associations between Baso%, Baso#, PDW, P-LCR, and MPV with IBC diagnosis and prognosis, aligning with findings from existing studies.^27–30 It is worth noting that PDW, as an important activated platelet parameter, is not only significantly associated with poor prognosis of IBC, but its decreased level is also associated with histological subtype, multifocal lesions, and lymph node metastasis status. Multivariate analysis further confirmed that PDW is an independent predictor of bone metastasis.³¹ In addition, the monocyte-lymphocyte ratio (MLR), as an inflammatory marker detectable in peripheral blood, reflects the dynamic balance between pro-tumor monocytes and anti-tumor lymphocytes and has shown predictive value for IBC treatment response in multiple studies.^32,33 In terms of imaging assessment, while ultrasound examination demonstrates high sensitivity for detecting axillary lymph node metastasis, it is characterized by low specificity.³⁴ Combining platelet parameters such as MPV with immune-related indicators, such as Lymph, can further improve the assessment efficacy for IBC malignant tumor patients.³⁵ Medical literature^36,37 also emphasizes the close correlation between age and the malignant progression and prognosis of IBC. Incorporating the typical tumor marker CA15-3³⁸ into clinical decision-making can lead to more effective diagnosis and treatment of IBC. These insights emphasize the importance of incorporating routinely collected clinical data into predictive models, as they can offer valuable insights into patient health without requiring costly and invasive tests.

Meanwhile, similar studies report³⁹ the use of CDP nanobiosensors, immunohistochemical data, and ML algorithms to diagnose IBC. However, these studies suffer from limited data scale and feature count and lack model interpretation, failing to account for the contribution of clinical indicators in predicting IBC. Consequently, the strength of this study lies in the accessibility and cost-effectiveness of routine laboratory test result data. Compared to current IBC screening technologies, this method not only significantly shortens appointment wait times and reduces diagnostic delays, but it is also not constrained by the frequency of screenings or the age at which screening begins.⁴⁰

Limitations and future work

Despite these advantages, several limitations must be acknowledged. The retrospective nature of data collection may introduce bias, and while the sample size is substantial, further increasing it would enhance the validation of our findings.^41,42 Additionally, although the model has demonstrated excellent performance in both the test sets and internal cohorts, its applicability to a broader population warrants further investigation. External validation in independent, multi-institutional cohorts is also needed to assess the generalizability and clinical potential of the findings.^43,44 Future research should focus on including diverse patient populations and considering the integration of conventional clinical indicators with novel biosensor data to enhance model interpretability and construct more comprehensive and reliable IBC prediction systems.

Conclusion

In summary, this study provides a promising alternative method for the early screening of IBC. An ML model was developed using 26 routine clinical examination indicators, which was then used to build a web-based tool for clinical application. With ongoing technological advancements, we believe these methods hold the promise of achieving even greater breakthroughs in the early screening of IBC.

Supplemental material

Supplemental material - Development and validation of a machine learning model for predicting invasive breast cancer using 26 routine clinical examination indicators

Supplemental material for Development and validation of a machine learning model for predicting invasive breast cancer using 26 routine clinical examination indicators by Lijuan Pan, Wenjing Deng, Ziwei Zhao, Yulong Liu, Xuelian Peng, Chunyan Yang, Baoru Han, Shan Shi and Jin Li in Digital Health.

Footnotes

Acknowledgements

This clinical research project has been approved by the Affiliated Dazu’s Hospital of Chongqing Medical University. We extend our sincere gratitude to all participants involved in this study.

ORCID iD

Jin Li

Ethical considerations

This study was carried out according to the protocol which was reviewed and approved by the Medical Ethics Committee of The Affiliated Dazu’s Hospital of Chongqing Medical University (Approval No. DZ2024-04-039). The Ethics Committee approved this study protocol and waived the obligation for informed consent because of the retrospective nature of the study.

Author contributions

L.P., W.D., and Y.L. collected the case and experimental data and drafted the main manuscript. D.W. analysed the experimental data, L.P. performed data analysis and interpretation, and H.B. and L.J. provided major funding for the study. B.H. S.S. and J.L. provided major revisions to the manuscript. All authors have read and approved the final manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was financially supported by grants from Chongqing Natural Science Foundation General Project (No. CSTB2024NSCO-MSX0439), Chongqing Medical Scientific Research Project (Joint Project of Chongqing Health Commission and Science and Technology Bureau) (No. 2024MSXM045), the Major Joint Science and Health Project of DaZu District (No. DZKJ2024JSYJ-KWXM1001), and the Intelligent Medical Project of Chongqing Medical University (No. ZHYX202206).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request. No datasets were generated or analysed during the current study.*

Supplemental material

Supplemental material for this article is available online.

Appendix

References

Bray

Laversanne

Sung

, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2024; 74(3): 229–263. https://doi.org/10.3322/caac.21834

Lei

Zheng

Zhang

, et al. Global patterns of breast cancer incidence and mortality: A population-based cancer registry data analysis from 2000 to 2020. Cancer Commun (Lond) 2021; 41(11): 1183–1194. https://doi.org/10.1002/cac2.12207

Giaquinto

Sung

Miller

, et al. Breast Cancer Statistics, 2022. CA Cancer J Clin 2022; 72(6): 524–541. https://doi.org/10.3322/caac.21754

Fan

Zhang

, et al. Burden of Disease Due to Cancer - China, 2000-2019. China CDC Wkly 2022; 4(15): 306–311. https://doi.org/10.46234/ccdcw2022.036

Fan

Strasser-Weippl

, et al. Breast cancer in China. Lancet Oncol 2014; 15(7): e279–e289. https://doi.org/10.1016/S1470-2045(13)70567-9

Black

Mittendorf

. et al. Landmark trials affecting the surgical management of invasive breast cancer. Surg Clin North Am 2013; 93(2): 501–518. https://doi.org/10.1016/j.suc.2012.12.007

Jagsi

. Progress and controversies: radiation therapy for invasive breast cancer. CA Cancer J Clin 2014; 64(2): 135–152. https://doi.org/10.3322/caac.21209

Tagliafico

Piana

Schenone

, et al. Overview of radiomics in breast cancer diagnosis and prognostication. Breast 2020; 49: 74–80. https://doi.org/10.1016/j.breast.2019.10.018

Chen

Bai

, et al. Deep Learning-Based Computer-Aided Diagnosis for Breast Lesion Classification on Ultrasound: A Prospective Multicenter Study of Radiologists Without Breast Ultrasound Expertise. AJR Am J Roentgenol 2023; 221(4): 450–459. https://doi.org/10.2214/AJR.23.29328

10.

Mitsala

Tsalikidis

Pitiakoudis

, et al. Artificial Intelligence in Colorectal Cancer Screening, Diagnosis and Treatment. A New Era. Curr Oncol 2021; 28(3): 1581–1607. https://doi.org/10.3390/curroncol28030149

11.

Chen

Lin

, et al. Artificial intelligence for assisting cancer diagnosis and treatment in the era of precision medicine. Cancer Commun (Lond) 2021; 41(11): 1100–1115. https://doi.org/10.1002/cac2.12215

12.

Yan

. et al. Artificial intelligence in breast cancer: application and future perspectives. J Cancer Res Clin Oncol 2023; 149(17): 16179–16190. https://doi.org/10.1007/s00432-023-05337-2

13.

Zan

Gao

, et al. A Machine Learning Method for Identifying Lung Cancer Based on Routine Blood Indices: Qualitative Feasibility Study. JMIR Med Inform 2019; 7(3): e13476. https://doi.org/10.2196/13476

14.

Gould

Huang

Tammemagi

, et al. Machine Learning for Early Lung Cancer Identification Using Routine Clinical and Laboratory Data. Am J Respir Crit Care Med 2021; 204(4): 445–453. https://doi.org/10.1164/rccm.202007-2791OC

15.

, et al. Machine Learning-Based Prediction of Distant Recurrence in Invasive Breast Carcinoma Using Clinicopathological Data: A Cross-Institutional Study. Cancers 2023; 15(15).

16.

Barkana

Ahmad

Essodegui

, et al. Characterization of mammographic markers of inflammatory breast cancer (IBC). Physica medica: PM: an international journal devoted to the applications of physics to medicine and biology: official journal of the Italian Association of Biomedical Physics (AIFB) 2024; 129: 104870. https://doi.org/10.1016/j.ejmp.2024.104870

17.

Rabah

Sattar

Ibrahim

, et al. A Multimodal Deep Learning Model for the Classification of Breast Cancer Subtypes. Diagnostics (Basel, Switzerland) 2025; 15(8): 995. https://doi.org/10.3390/diagnostics15080995

18.

Lei

Zheng

Zhang

, et al. Breast cancer incidence and mortality in women in China: temporal trends and projections to 2030. Cancer Biol Med 2021; 18(3): 900–909. https://doi.org/10.20892/j.issn.2095-3941.2020.0523

19.

Yan

Ren

Jia

, et al. Breast cancer risk factors and mammographic density among 12518 average-risk women in rural China. BMC Cancer 2023; 23(1): 952. https://doi.org/10.1186/s12885-023-11444-7

20.

Sun

Zhang

Lei

, et al. Incidence, mortality, and disability-adjusted life years of female breast cancer in China, 2022. Chin Med J (Engl) 2024; 137(20): 2429–2436. https://doi.org/10.1097/CM9.0000000000003278

21.

Zhang

, et al. High Diagnostic Accuracy of Epigenetic Imprinting Biomarkers in Thyroid Nodules. J Clin Oncol 2023; 41(6): 1296–1306. https://doi.org/10.1200/JCO.22.00232

22.

Jannusch

Dietzel

Bruckmann

, et al. Prediction of therapy response of breast cancer patients with machine learning based on clinical data and imaging data derived from breast [(18)F] FDG-PET/MRI. Eur J Nucl Med Mol Imaging 2024; 51(5): 1451–1461. https://doi.org/10.1007/s00259-023-06513-9

23.

Mango

Olasehinde

Omisore

, et al. The iBreastExam versus clinical breast examination for breast evaluation in high risk and symptomatic Nigerian women: a prospective study. Lancet Glob Health 2022; 10(4): e555–555e563. https://doi.org/10.1016/S2214-109X(22)00030-4

24.

Yang

Chen

, et al. XGBoost-SHAP-based interpretable diagnostic framework for alzheimer's disease. BMC Med Inform Decis Mak 2023; 23(1): 137. https://doi.org/10.1186/s12911-023-02238-9

25.

Wang

Tian

Zheng

, et al. Interpretable prediction of 3-year all-cause mortality in patients with heart failure caused by coronary heart disease based on machine learning and SHAP. Comput Biol Med 2021; 137: 104813. https://doi.org/10.1016/j.compbiomed.2021.104813

26.

Wojtuch

Jankowski

Podlewska

. et al. How can SHAP values help to shape metabolic stability of chemical compounds. J Cheminform 2021; 13(1): 74. https://doi.org/10.1186/s13321-021-00542-y

27.

Graziano

Grassadonia

Iezzi

, et al. Combination of peripheral neutrophil-to-lymphocyte ratio and platelet-to-lymphocyte ratio is predictive of pathological complete response after neoadjuvant chemotherapy in breast cancer patients. Breast 2019; 44: 33–38. https://doi.org/10.1016/j.breast.2018.12.014

28.

Wei

Yao

Xing

, et al. The neutrophil lymphocyte ratio is associated with breast cancer prognosis: an updated systematic review and meta-analysis. Onco Targets Ther 2016; 9: 5567–5575. https://doi.org/10.2147/OTT.S108419

29.

Petrone

Gaulin

Derkach

, et al. Routine clinical parameters and laboratory testing predict therapy-related myeloid neoplasms after treatment for breast cancer. Haematologica 2023; 108(1): 161–170. https://doi.org/10.3324/haematol.2021.280437

30.

Divsalar

Heydari

Habibollah

, et al. Hematological Parameters Changes in Patients with Breast Cancer. Clin Lab 2021; 67(8). https://doi.org/10.7754/Clin.Lab.2020.201103

31.

Song

Zhao

Huang

, et al. Preoperative platelet distribution width predicts bone metastasis in patients with breast cancer. BMC Cancer 2024; 24(1): 1066. https://doi.org/10.1186/s12885-024-12837-y

32.

Obeagu

. Monocyte-to-lymphocyte ratio as a predictive marker for breast cancer treatment outcomes: a narrative review. Annals of medicine and surgery (2012), 2025, 87(11): 7262–7266.

33.

Obeagu

. Monocyte-to-lymphocyte ratio as a subtype-specific biomarker in breast cancer prognosis: a narrative review. Annals of medicine and surgery 2012), 2025; 87(12): 8617–8623. https://doi.org/10.1097/MS9.0000000000004218

34.

Laiq

Masood

Siddiqui

, et al. Prediction of axillary lymph node metastasis in breast cancer patients based on ultrasonograhic-clinicopathologic features. Pakistan journal of medical sciences 2025; 41(1): 196. 100. https://doi.org/10.12669/pjms.41.1.10384

35.

X-H

Wang

, et al. Preoperative mean platelet volume predicts survival in breast cancer patients with type 2 diabetes. Breast cancer (Tokyo, Japan) 2019; 26(6): 712–718. https://doi.org/10.1007/s12282-019-00976-1

36.

Akben

Yumrutaş

. et al. A simple and fast explainable artificial intelligence-based pre-screening tool for breast cancer tumor malignancy detection. Scientific Reports 2025; 15(1): 34347. https://doi.org/10.1038/s41598-025-16842-4

37.

Mendes

Oliveira

Araújo

, et al. You get the best of both worlds? Integrating deep learning and traditional machine learning for breast cancer risk prediction. Computers in biology and medicine 2025; 187: 109733. https://doi.org/10.1016/j.compbiomed.2025.109733

38.

Naeimi

Harsini

Derbekyan

, et al. Association of tumor markers CA 15-3, CEA, and CA 125 with [18F]NaF PET findings in breast cancer patients. Frontiers in oncology 2025; 15: 1673504. https://doi.org/10.3389/fonc.2025.1673504

39.

Amraei

Mirzapoor

Motarjem

, et al. Enhancing breast cancer diagnosis through machine learning algorithms. Scientific Reports 2025; 15(1): 23316. https://doi.org/10.1038/s41598-025-07628-9

40.

Wei

Liang

, et al. Cost-effective prognostic evaluation of breast cancer: using a STAR nomogram model based on routine blood tests. Front Endocrinol (Lausanne) 2024; 15: 1324617. https://doi.org/10.3389/fendo.2024.1324617

41.

Swanson

Zhang

, et al. From patterns to patients: Advances in clinical machine learning for cancer diagnosis, prognosis, and treatment. Cell 2023; 186(8): 1772–1791. https://doi.org/10.1016/j.cell.2023.01.035

42.

Tran

Kondrashova

Bradley

, et al. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Med 2021; 13(1): 152. https://doi.org/10.1186/s13073-021-00968-x

43.

Guo

Zhang

Yuan

, et al. Machine learning and new insights for breast cancer diagnosis. J Int Med Res 2024; 52(4): 3000605241237867. https://doi.org/10.1177/03000605241237867

44.

Painuli

Bhardwaj

Köse

. et al. Recent advancement in cancer diagnosis using machine learning and deep learning techniques: A comprehensive review. Comput Biol Med 2022; 146: 105580. https://doi.org/10.1016/j.compbiomed.2022.105580

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.98 MB

0.00 MB