Abstract
Background
Invasive breast cancer (IBC) is the most prevalent malignant tumor in women globally and a leading cause of female mortality, with increasing incidence and death rates. Recent advancements in machine learning (ML) have shown significant potential in IBC prediction. This study aimed to assess different ML strategies to develop an optimal model for predicting IBC based on routine clinical examination indicators.
Methods
We collected routine blood parameters, serum tumor marker indicators, and age data from 1,175 IBC patients at the Affiliated Dazu’s Hospital of Chongqing Medical University. From these datasets, we identified 26 key routine clinical examination indicators, including 23 blood routine parameters, 2 tumor marker indicators, and age. We constructed an IBC prediction model using 10 ML algorithms. The performance of these models was evaluated using the test set and internal validation set, with evaluation metrics including accuracy, positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity, F1 score, and area under the curve (AUC). Ultimately, an optimal web tool for predicting IBC was developed based on these models.
Results
In the internal testing cohort, we assessed ten ML models. The XGBoost-based web tools emerged as the optimal choice, achieving an AUC exceeding 0.970 on both the test set and internal validation cohorts. Interpretability analysis using Shapley additive explanations (SHAP) revealed that basophils, platelet distribution width (PDW), and age features ranked highly in the feature importance of XGBoost models for IBC prediction, highlighting the importance of incorporating routinely collected clinical data into IBC prediction models.
Conclusions
The ML-based web tool developed using 26 routine clinical examination indicators has shown considerable promise in predicting IBC. Among the models, the XGBoost algorithm exhibited the highest performance, becoming a reliable predictive tool that can enhance clinical decision-making and improve the accuracy of IBC diagnoses.
Keywords
Introduction
Invasive breast cancer (IBC) poses a significant threat to women’s health worldwide. According to data from the International Agency for Research on Cancer (IARC) published in GLOBOCAN 2022, approximately 2.26 million new cases of IBC are diagnosed each year globally, making it the most common cancer among women and accounting for 23.8% of all female cancer cases. 1 While the incidence rates vary by region and year, the overall trend shows a rise in IBC cases worldwide.2,3 China reports around 300,000 new diagnoses annually, with a concerning trend toward younger onset, making premature death a major burden of the disease.4,5
IBC is the most prevalent subtype, and early detection is crucial for improving patient outcomes. Currently, screening and diagnosis of IBC primarily rely on traditional methods such as mammography, ultrasound, and biopsy.6–8 However, these approaches have several limitations, including low diagnostic accuracy, delayed diagnosis, high costs, and long appointment wait times, as well as the risks associated with invasive procedures and radiation exposure. Moreover, in China’s rural and under-resourced healthcare settings, misdiagnosis and missed diagnoses are more common, leading many patients to miss optimal treatment opportunities and, in some cases, experience irreversible complications. 9 Therefore, improving early recognition of IBC and implementing effective high-risk screening in community hospitals and primary care centers has become an urgent issue.
In recent years, artificial intelligence (AI) has made significant advances in early cancer diagnosis and treatment.10–12 While most cancer prediction models rely on multimodal data from imaging and laboratory tests, there is growing evidence that routine clinical examination indicators have significant potential in early cancer prediction, particularly when integrated with machine learning (ML) techniques.13,14 By combining routine clinical examination indicators with ML algorithms, it is possible to develop an early warning screening model for IBC. This approach can enhance the diagnostic efficiency and accuracy in primary care settings, significantly reducing the rate of missed diagnoses. The application of ML can enable more patients, especially those in resource-limited areas, to benefit from expert-level diagnostic knowledge, thereby improving early awareness of IBC in underdeveloped regions—a matter of considerable clinical significance.
A literature review revealed that Sukhadia et al. 15 developed a machine learning-based model to predict the risk of distant recurrence in invasive breast cancer (IBC) using clinicopathological data before and after treatment. Cross-institutional validation demonstrated that the random forest model performed optimally, highlighting the crucial predictive value of imaging assessment of treatment response. Barkana et al. 16 developed an innovative breast mapping and scanning model, utilizing a grey-level co-occurrence matrix (GLCM) to quantify the mammographic features of inflammatory breast cancer, laying an important foundation for machine learning-assisted diagnostic models. Ben Rabah et al. 17 proposed a multimodal deep learning model that integrates mammographic images with clinical metadata to achieve non-invasive classification of IBC subtypes. This research provides an AI-driven, innovative approach for personalized IBC diagnosis and treatment. Therefore, there are currently no studies that have reported the development of ML-based high-risk screening models using routine clinical examination indicators for IBC. This study aims to construct and evaluate various ML-based models for predicting IBC by leveraging routine clinical examination indicators, ultimately identifying the most effective model to aid in the early identification of high-risk patients and optimize their treatment timelines.
Materials and methods
Data sources and study population
This study retrospectively analyzed 1,175 female patients with invasive breast diseases who first visited the Affiliated Dazu’s Hospital of Chongqing Medical University between January 1, 2018, and December 31, 2023. Routine clinical examination indicators from 131 IBC patients and 355 non-IBC patients who first visited between January 1, 2018, and December 31, 2019, were selected for the internal validation cohort. Indicators from 305 IBC patients and 384 non-IBC patients who first visited between January 1, 2020, and December 31, 2023, were selected for the model establishment cohort and test set cohort.
The model establishment cohort was used for feature selection, hyperparameter tuning, and model development, while the internal validation cohort and test set cohort evaluated the model’s performance. The inclusion criteria were as follows: 1) Patients confirmed to have IBC through pathological examination; 2) Complete routine clinical examination indicators; 3) Patients who had not received any treatment before the IBC diagnosis. Exclusion criteria were: 1) Patients with incomplete routine clinical examination indicators; 2) IBC patients with comorbidities. Non-IBC patients diagnosed during the same period were selected as the control group.
Data pre-processing
In this study, we collected 38 routine clinical examination indicators that are both cost-effective and widely available, including routine blood parameters, electrolytes, tumor markers, ferritin levels, and age. Indicators with more than 25% missing data in the entire dataset were excluded from model training, leaving a final selection of 26 routine clinical examination indicators for analysis. These included hemoglobin (Hb; g/L), hematocrit (Hct; %), mean corpuscular hemoglobin (MCH; pg), mean corpuscular hemoglobin concentration (MCHC; g/L), mean corpuscular volume (MCV; fL), mean platelet volume (MPV; fL), platelet large cell ratio (P-LCR; %), platelet distribution width (PDW; %), plateletcrit (PCT; %), platelet count (PLT; 109/L), red blood cell count (RBC; 1012/L), white blood cell count (WBC; 109/L), neutrophil percentage (Neut%; %), lymphocyte percentage (Lymph%; %), monocyte percentage (Mono%; %), eosinophil percentage (Eos%; %), basophil percentage (Baso%; %), absolute neutrophil count (Neut#; 109/L), absolute lymphocyte count (Lymph#; 109/L), absolute monocyte count (Mono#; 109/L), absolute eosinophil count (Eos#; 109/L), absolute basophil count (Baso#; 109/L), red cell distribution width (RDW-CV; %), glycan antigen 15-3 (CA15-3; U/mL), carcinoembryonic antigen (CEA; ng/mL), and age (years).
Statistical analysis
Throughout the model development process, we used Python 3.11 as the programming environment, incorporating libraries such as Scikit-learn 1.4.2 for ML, SHAP 0.45.1 for model interpretability, Matplotlib 3.8.2 for visualization, Pandas 2.2.2 for data handling, and NumPy 1.26.3 to ensure efficient and accurate completion of ML tasks.
This study statistically analyzed all data from IBC, including the distribution of demographic characteristics and routine laboratory parameters in the internal validation cohort and the model establishment and test set cohorts. Differences in age, tumor markers, and routine blood tests between the IBC and non-IBC groups were calculated, along with the corresponding means, standard deviations (SD), medians, interquartile ranges, and p-values. Continuous variables were compared using analysis of variance (ANOVA), with p-values adjusted using the false discovery rate (FDR) method. A p-value less than 0.05 was considered statistically significant between the IBC and non-IBC groups.
Performance of different models
To develop an IBC prediction model, we employed ten different ML algorithms: support vector machine (SVM), multilayer perceptron (MLP), logistic regression (LR), K-nearest neighbors (KNN), decision tree (DT), random forest (RF), Xtreme gradient boosting (XGBoost), gradient boosting decision tree (GBDT), light gradient boosting machine (LightGBM), and adaptive boosting (AdaBoost). To enhance model performance, we used a random search method for hyperparameter tuning, selecting the area under the curve (AUC) as the primary evaluation metric.
After optimization, we evaluated the models using stratified 10-fold cross-validation to test their generalization ability on new data. Compared to traditional n-fold cross-validation, stratified 10-fold cross-validation is particularly well-suited for imbalanced datasets, ensuring that each fold maintains the same class proportions as the overall dataset, thereby improving consistency and evaluation stability. Additionally, we used the bootstrap method to calculate confidence intervals for the evaluation metrics.
Finally, we applied the optimized parameters to train and test different data groups, constructing auxiliary diagnostic models, which were then evaluated based on their practical performance in IBC diagnosis. The model development process is illustrated in Figure S1.
Model validation
During the model validation phase, we utilize samples from the test set queue and the internal validation queue to further evaluate the performance of the model, which will assess the key performance indicators of each model.
After optimizing and training various models, several key performance indicators were calculated, including accuracy, positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity, recall, F1 score, and AUC. Accuracy represents the proportion of samples for which the model’s predictions match the true labels. PPV and NPV reflect the model’s reliability in predicting different classes. Sensitivity and specificity measure the model’s ability to correctly identify positive and negative samples, respectively. Recall, which is identical to sensitivity, assesses the model’s capability to correctly classify positive samples. The F1 score, the harmonic mean of precision and recall, provides a balanced evaluation of both metrics. Finally, the AUC is a comprehensive metric that evaluates the model’s overall performance at various thresholds. An AUC value closer to 1 indicates a stronger ability of the model to distinguish between positive and negative samples.
By comparing and analyzing these evaluation metrics, we can gain a deeper understanding of the strengths and weaknesses of each model, allowing us to select the most appropriate model for this IBC dataset (Figure S2).
System development
To improve clinicians’ accuracy in the early screening of IBC, we developed a web-based tool that integrates routine blood parameters, tumor marker indicators, age, and the optimal ML model. The system’s homepage displays daily statistics on hospital visits and patient data (Figure S3). Doctors can log into the system to view all test data for each patient (Figure S4). By inputting a patient’s personal information, doctors can generate an IBC diagnosis report to use as a reference (Figures S5-S6). Clinicians can then decide, based on their clinical experience, whether to adopt the system’s predicted diagnosis and confirm their decision within the system.
Results
Patient characteristics and variables
In the internal validation cohort, a total of 486 patients were included, comprising 131 patients with IBC and 355 patients with non-IBC. In the model establishment and test set cohorts, there were a total of 698 patients, including 305 patients with IBC and 384 patients with non-IBC. In the internal validation cohort, several variables demonstrated highly significant differences between the positive and negative groups, with p-values less than 0.001. These variables included age, Baso#, Baso%, Hb, Hct, Lymph#, Lymph%, MPV, Mono%, P-LCR, and PCT. Additionally, CA15-3, CEA, Neut%, and PLT had p-values below 0.05, indicating statistically significant differences. In contrast, variables such as Eos#, Eos%, MCH, MCHC, MCV, Neut#, RDW-CV, and WBC did not show statistically significant differences (Table S1). In the model establishment and test set cohorts, age, CA15-3, CEA, Baso#, Baso%, Hb, Hct, Lymph#, Lymph%, MPV, Mono#, and Mono% all exhibited p-values below 0.001, highlighting their strong discriminatory power between IBC and non-IBC cases. MCV and Neut# also showed significant differences, with p-values less than 0.05. However, Eos#, Eos%, MCH, MCHC, and WBC did not exhibit statistically significant differences (Table S2).
Evaluation of the predictive performance of different models for IBC group and non-IBC group in the test set cohort
As shown in Table S3, validation using the test set cohort revealed that all ten ML algorithms achieved AUC values greater than 0.920, indicating high classification performance. The XGBoost model achieved a peak AUC of 0.975 (95% CI: 0.945-1.000) (Figure 1(a)). Furthermore, among the other six evaluation metrics, the XGBoost model also achieved the highest values for accuracy and F1 score, at 0.935 (95% CI: 0.892-0.970) and 0.934 (95% CI: 0.891-0.964), respectively. Its NPV and sensitivity values were also good, at 0.967 (95% CI: 0.945-0.990) and 0.961 (95% CI: 0.928-0.987), respectively (Figure 2(a)). Overall, the XGBoost model consistently achieved AUCs above 0.910 across all evaluation metrics, demonstrating superior performance compared to other models. Therefore, the XGBoost model was established as the optimal choice for IBC prediction. ROC curves of 10 ML models. (a) The AUC value results in the test set cohort. (b) The AUC value results in the internal validation cohort. The horizontal axis is the False Positive Rate (FPR), the vertical axis is the True Positive Rate (TPR), and the area under each curve is the AUC value of the model, which is used to measure the overall performance of the model. Results of six evaluation metrics across 10 ML models. The figure shows the model’s performance on the test set (a) and internal validation (b). (a) Line chart. Different colors represent different models; the horizontal axis represents the evaluation metric, and the vertical axis represents the value of the evaluation metric. (b) Bar chart. Different colors represent different evaluation metrics; the horizontal axis represents multiple models, and the vertical axis represents the evaluation value.

Evaluation of the predictive performance of different models for IBC group and non-IBC group in internal validation cohort
As shown in Table S4, we further evaluated the performance of our models using the internal validation cohort. By comparing seven evaluation metrics across ten ML models, we found that the XGBoost model performed exceptionally well on all of them. To comprehensively analyze the ROC curves and AUC values of these ten models, we plotted the ROC curves, as shown in Figure 1(b). All ten models achieved an AUC value exceeding 0.920, indicating excellent classification performance. Notably, the XGBoost model had the highest AUC value, reaching 0.982 (95% CI: 0.970-0.992). And this model significantly outperformed other comparative models in four key metrics: accuracy, NPV, sensitivity, and F1 score, with values of 0.947 (95% CI: 0.922, 0.967), 0.960 (95% CI: 0.938, 0.977), 0.885 (95% CI: 0.823, 0.938), and 0.898 (95% CI: 0.855, 0.934), respectively (Figure 2(b)).
Analysis of model interpretability
Shapley additive explanations (SHAP) is a powerful tool for interpreting ML models and assessing the importance of each feature in relation to model predictions. According to Figure 3, the top 10 features of the XGBoost model, ranked by importance, are Baso%, Baso#, PDW, age, Mono%, CA15-3, PCT, Lymph#, P-LCR, and MPV. The influence of these features remains generally consistent across the various models. Notably, the XGBoost model is mainly affected by the basophil index in regular blood testing, which is crucial for the early detection of IBC. Visualization of SHAP values plot for XGBoost, the top-performing machine learning model. The SHAP values of the top ten features of different routine clinical examination indicators in the early prediction of IBC and non-IBC groups are shown, reflecting their importance to the model prediction.
Discussion
Principal results
IBC remains a significant threat to the physical and mental health of women, particularly in China, where its incidence continues to rise.18–20 Early diagnosis of IBC is crucial for improving patient prognosis and alleviating the burden of the disease. This study successfully developed and validated a new method for predicting IBC using an ML model based on 26 routine clinical examination indicators, while also exploring its clinical application value.
The core finding of this study is that the XGBoost model exhibits extremely high discriminative performance (AUC > 0.970) in both the independent test set and internal validation cohort, confirming its ability to reliably distinguish between IBC and non-IBC patients. This performance not only significantly outperforms traditional clinical diagnostic methods but also surpasses nine other comparative ML models, highlighting its modeling advantages in complex medical data. XGBoost’s superior performance is mainly due to two factors. First, it automatically identifies key feature combinations through the gradient boosting tree algorithm, a nonlinear relationship that traditional statistical models struggle to capture. Second, regularization strategies and cross-validation effectively reduce the risk of overfitting, ensuring the model’s stability on external data. Notably, the model’s high sensitivity and NPV suggest that it can reliably identify patients who do not have the disease, thereby reducing unnecessary follow-up examinations and alleviating patient anxiety.21–23 Additionally, feature importance analysis using SHAP values revealed that routine clinical examination indicators significantly contributed to the model’s predictions.24–26 Among these, the Baso% indicator emerged as the most influential feature; the emergence of inflammatory responses in the body may be directly linked to abnormalities in regular blood inflammatory indicators, underscoring its potential application in clinical practice. Our research also identified associations between Baso%, Baso#, PDW, P-LCR, and MPV with IBC diagnosis and prognosis, aligning with findings from existing studies.27–30 It is worth noting that PDW, as an important activated platelet parameter, is not only significantly associated with poor prognosis of IBC, but its decreased level is also associated with histological subtype, multifocal lesions, and lymph node metastasis status. Multivariate analysis further confirmed that PDW is an independent predictor of bone metastasis. 31 In addition, the monocyte-lymphocyte ratio (MLR), as an inflammatory marker detectable in peripheral blood, reflects the dynamic balance between pro-tumor monocytes and anti-tumor lymphocytes and has shown predictive value for IBC treatment response in multiple studies.32,33 In terms of imaging assessment, while ultrasound examination demonstrates high sensitivity for detecting axillary lymph node metastasis, it is characterized by low specificity. 34 Combining platelet parameters such as MPV with immune-related indicators, such as Lymph, can further improve the assessment efficacy for IBC malignant tumor patients. 35 Medical literature36,37 also emphasizes the close correlation between age and the malignant progression and prognosis of IBC. Incorporating the typical tumor marker CA15-3 38 into clinical decision-making can lead to more effective diagnosis and treatment of IBC. These insights emphasize the importance of incorporating routinely collected clinical data into predictive models, as they can offer valuable insights into patient health without requiring costly and invasive tests.
Meanwhile, similar studies report 39 the use of CDP nanobiosensors, immunohistochemical data, and ML algorithms to diagnose IBC. However, these studies suffer from limited data scale and feature count and lack model interpretation, failing to account for the contribution of clinical indicators in predicting IBC. Consequently, the strength of this study lies in the accessibility and cost-effectiveness of routine laboratory test result data. Compared to current IBC screening technologies, this method not only significantly shortens appointment wait times and reduces diagnostic delays, but it is also not constrained by the frequency of screenings or the age at which screening begins. 40
Limitations and future work
Despite these advantages, several limitations must be acknowledged. The retrospective nature of data collection may introduce bias, and while the sample size is substantial, further increasing it would enhance the validation of our findings.41,42 Additionally, although the model has demonstrated excellent performance in both the test sets and internal cohorts, its applicability to a broader population warrants further investigation. External validation in independent, multi-institutional cohorts is also needed to assess the generalizability and clinical potential of the findings.43,44 Future research should focus on including diverse patient populations and considering the integration of conventional clinical indicators with novel biosensor data to enhance model interpretability and construct more comprehensive and reliable IBC prediction systems.
Conclusion
In summary, this study provides a promising alternative method for the early screening of IBC. An ML model was developed using 26 routine clinical examination indicators, which was then used to build a web-based tool for clinical application. With ongoing technological advancements, we believe these methods hold the promise of achieving even greater breakthroughs in the early screening of IBC.
Supplemental material
Supplemental material - Development and validation of a machine learning model for predicting invasive breast cancer using 26 routine clinical examination indicators
Supplemental material for Development and validation of a machine learning model for predicting invasive breast cancer using 26 routine clinical examination indicators by Lijuan Pan, Wenjing Deng, Ziwei Zhao, Yulong Liu, Xuelian Peng, Chunyan Yang, Baoru Han, Shan Shi and Jin Li in Digital Health.
Footnotes
Acknowledgements
This clinical research project has been approved by the Affiliated Dazu’s Hospital of Chongqing Medical University. We extend our sincere gratitude to all participants involved in this study.
Ethical considerations
This study was carried out according to the protocol which was reviewed and approved by the Medical Ethics Committee of The Affiliated Dazu’s Hospital of Chongqing Medical University (Approval No. DZ2024-04-039). The Ethics Committee approved this study protocol and waived the obligation for informed consent because of the retrospective nature of the study.
Author contributions
L.P., W.D., and Y.L. collected the case and experimental data and drafted the main manuscript. D.W. analysed the experimental data, L.P. performed data analysis and interpretation, and H.B. and L.J. provided major funding for the study. B.H. S.S. and J.L. provided major revisions to the manuscript. All authors have read and approved the final manuscript.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was financially supported by grants from Chongqing Natural Science Foundation General Project (No. CSTB2024NSCO-MSX0439), Chongqing Medical Scientific Research Project (Joint Project of Chongqing Health Commission and Science and Technology Bureau) (No. 2024MSXM045), the Major Joint Science and Health Project of DaZu District (No. DZKJ2024JSYJ-KWXM1001), and the Intelligent Medical Project of Chongqing Medical University (No. ZHYX202206).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request. No datasets were generated or analysed during the current study.
Supplemental material
Supplemental material for this article is available online.
Appendix
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
