Sage Journals: Discover world-class research

Abstract

Objective

To develop and validate a robust machine learning-based prediction model for assessing the risk of thrombotic events in critically ill cancer patients during their ICU stay.

Methods

This retrospective observational study utilized data from 1892 cancer patients in the MIMIC-IV database for model development and internal validation. A stringent data preprocessing pipeline was applied, including multiple imputation for missing data, exclusion of outliers, and the use of the Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance. Feature importance was evaluated using SHAP, leading to the selection of six key predictors. Nine machine learning models were constructed and compared. Model performance was assessed using the Area Under the Curve (AUC), F1-score, recall, Matthews correlation coefficient (MCC), accuracy, and specificity. The optimal model was selected, calibrated, and interpreted using SHAP. Its clinical utility was further evaluated via calibration curves and decision curve analysis (DCA). Finally, external validation was performed on an independent dataset of 200 patients from our institution.

Results

The CatBoost model demonstrated superior performance. In internal validation, the calibrated model achieved an AUC of 0.855 (95% CI: 0.797-0.913), with a sensitivity of 0.971 and a specificity of 0.753 at an optimal threshold of 0.245. In external validation, the model maintained strong performance with an AUC of 0.83 (95% CI: 0.742-0.918), sensitivity of 0.968, and specificity of 0.698. SHAP analysis identified “history of thrombosis” as the most influential predictor. Decision curve analysis confirmed the model's clinical utility across a wide risk threshold range (0.25-0.75). The final model was deployed as an online platform to facilitate real-time, individualized risk assessment.

Conclusion

The developed CatBoost model exhibits excellent discriminatory power, good calibration, and favorable clinical interpretability for predicting thrombosis risk in critically ill cancer patients. It serves as a promising and reliable clinical decision support tool to guide personalized thromboprophylaxis and improve patient outcomes.

Keywords

machine learning thrombosis critically ill patients with cancer predictive modeling MIMIC-IV database explainable artificial intelligence (XAI)

Introduction

Cancer patients, particularly those admitted to the intensive care unit (ICU) in critical condition, are at exceptionally high risk for venous thromboembolism (VTE).^1,2 Within the ICU environment, this elevated risk is further amplified by cancer-related hypercoagulability, frequent invasive procedures, prolonged immobilization, and multiple comorbidities.^3,4 Accurate and timely assessment of thrombotic risk in this vulnerable population is therefore essential to guide appropriate prophylactic anticoagulation and improve clinical outcomes.⁵

Current clinical tools for VTE risk stratification in cancer patients—such as the Caprini, Padua, and Khorana scores—exhibit notable limitations.⁶ The Caprini score is primarily validated in surgical populations and lacks generalizability to non-surgical critically ill patients ; the Padua score demonstrates suboptimal predictive performance in certain ethnic groups, including Chinese populations, likely due to genetic and epidemiological differences; while the Khorana score, although tailored for ambulatory chemotherapy recipients, relies on a static set of variables derived from limited observational data, making it insufficient to reflect the rapidly evolving physiological dynamics of ICU patients.^7,8 These conventional models depend on fixed-variable frameworks and fail to leverage the rich, multidimensional, and real-time data routinely captured in ICU settings, resulting in suboptimal sensitivity and limited utility in the era of precision medicine.

In contrast, machine learning (ML) approaches offer the capacity to analyze high-dimensional datasets and model complex, nonlinear interactions among variables, thereby enabling more accurate and adaptive risk prediction in heterogeneous clinical environments.^9,10 Advances in electronic health record systems and access to large-scale, open-source critical care databases—such as MIMIC-IV—have created new opportunities to develop dynamic, individualized VTE risk prediction models using ML techniques.^11,12

This study therefore aimed to develop and validate an interpretable machine learning model to predict thrombosis risk specifically in ICU-admitted patients with malignant tumors.^13,14 We utilized the Medical Information Mart for Intensive Care (MIMIC-IV) database to train and test multiple algorithms.¹⁵ The Shapley Additive exPlanations (SHAP) method was employed to identify and interpret the key predictors from a comprehensive set of clinical variables, vital signs, and severity scores . Model performance was rigorously evaluated and compared using metrics including the area under the receiver operating characteristic curve (AUC) and the F1-score. Our ultimate goal was to establish an optimal, interpretable predictive tool to aid clinicians in proactive decision-making and potentially improve patient outcomes in this high-stakes clinical environment.

Methods

Study Design and Population

This retrospective, observational study was conducted utilizing the Critical Care Medical Information Marketplace IV (MIMIC-IV) (v2.2) database for predictive model development and internal validation.¹⁵ Developed by the Computational Physiology Laboratory at the Massachusetts Institute of Technology, MIMIC-IV is an open-access resource that contains comprehensive electronic health records of ICU patients from the Beth Israel Deaconess Medical Center in the United States . The researcher, Yang Chang, underwent the necessary data access training, secured the required permissions, and extracted the relevant data.

Patients with their first ICU admission having a confirmed diagnosis of a malignant tumor (CA), as indicated by the MIMIC-IV diagnostic codes, were included in this study. The following patients were excluded from this study: (1) patients aged < 18 years at admission, (2) patients with non-first ICU admissions, (3) patients with ICU stay < 1 day, (4) patients with key variable missing rates > 50%, (5) patients with incomplete critical information on the first day of admission (eg, D-dimer, platelets, hemoglobin, white blood cell count), (6) patients with pregnancy-related or postpartum thrombosis. Ultimately, 1892 patients were included in this study.

An independent external validation was conducted using a dataset of 200 ICU patients with malignancies from Chongqing University Cancer Hospital (from December 2024 to June 2025). Inclusion criteria were a diagnosis containing “malignancy” or “cancer” and admission to the ICU for the first time, with all patients undergoing Doppler ultrasound to confirm thrombosis diagnosis. Exclusion criteria were consistent with those for the MIMIC-IV study population. This external validation set was used to assess the model's generalizability.^13,16

The study was conducted in accordance with the ethical principles outlined in the Declaration of Helsinki. Use of the MIMIC-IV database was approved by the Massachusetts Institute of Technology Institutional Review Board (IRB). Use of the institutional database was approved by the Ethics Committee of Chongqing University Cancer Hospital (Approval Number: CZLS2023375-A). As all patient identifiers were removed prior to analysis, the requirement for informed consent was waived.

The primary objective of this study was to predict the risk of thrombosis in ICU patients with malignant tumors. The literature indicates that the incidence of VTE varies across patients with cancer. A study reported a VTE incidence of 3.7% in ICU patients, whereas another study noted that patients with high-risk cancer could experience VTE rates > 10%.^1,8 Considering the characteristics of the study population and potential high-risk factors, a 10% event rate was adopted for the sample size estimation.^2,17

To ensure the robustness of the predictive model, the sample size was estimated based on the “events per variable (EPV)” principle, using the following formula:

n = \frac{E P V \times K}{P}

where: n represents the sample size; EPV represents the number of events required per variable, set to 20 in this study to ensure model robustness; K denotes the number of candidate predictors, which is 8 in this case; P signifies the event rate set to 0.10 for this study.

Substituting the values, we obtain the following:

n = 20 \times 8 / 0.10 = 1600.

The calculation yielded a minimum required sample size of 1600. Our final dataset of 1892 patients satisfied the criterion of EPV ≥ 20, which aligns with recommendations for predictive model development.¹⁸

Feature Extraction and Outcome Definition

Clinical Characteristics and Data Processing

This study utilized a no-code tool (DecisionLinnc software) for data extraction and preprocessing. As a no-code tool was employed, specific preprocessing scripts cannot be provided. To ensure the reproducibility and transparency of the research results, this paper will offer the SQL query scripts used to extract data from the MIMIC-IV database in the Appendix. These query scripts will assist other researchers in replicating the data extraction process. Additionally, to further enhance reproducibility, we also provided an external validation set from our hospital for model validation. The extracted potential predictors were categorized into six groups:

Demographic details, such as race, age, sex, weight, height, and marital status

Comorbid conditions, including sepsis (sepsis3,¹⁹ respiratory, coagulation, liver, cardiovascular, central nervous system, renal), acute kidney injury stage, and history of thrombosis

Laboratory measurements, such as platelet count, white blood cell count, hemoglobin and D-dimer levels, and neutrophil and lymphocyte counts

Admission severity scores, including the SOFA,²⁰ Glasgow Coma Scale (GCS),²¹ Systemic Inflammatory Response Syndrome (SIRS),²² and APACHE II scores,²³ Simplified Pulmonary Embolism Severity Index (SPESI) score,²⁴ Shock index²⁵

Mechanical ventilation

Fluid balance metrics, with all laboratory variables and severity scores obtained from initial tests or assessments conducted upon ICU admission

The overall dataset can be found in Appendix 1, and the specific SQL extraction queries are detailed in Appendix 2.

Outcome Measurement

The primary outcome measure was thrombosis occurrence. The corresponding thrombosis complication codes extracted from the MIMIC-IV database are presented. For the external validation set, the outcome was thrombosis diagnosed by Doppler ultrasound.³

Data Preprocessing and Feature Selection

Data Preprocessing

To minimize bias and enhance model accuracy, all indicators were processed using real values, whenever possible. The DecisionLinnc software was used to preprocess the data, screen, and exclude samples with missing values from the scoring sheets. For continuous variables, multiple imputation techniques were applied to address missing data (platelet count, 0.3%; white blood cell count, 0.1%).¹⁷ Subsequently, the dataset was divided into training and test sets at a 7:3 ratio for patients with thrombosis. Outliers in the training set were identified using the boxplot method based on quartiles and interquartile ranges (IQRs), with data points outside Q1–1.5IQR or Q3 + 1.5IQR classified as outliers . Continuous variables (eg, hemoglobin level, platelet count, white blood cell count, neutrophil count, ventilation time, fluid balance, input/output fluid volume) were filtered. Owing to the low incidence of thrombosis, class imbalance was addressed by applying the Synthetic Minority Over-sampling Technique (SMOTE) method to the training set and adjusting the ratio of patients without thrombosis to patients with thrombosis from 1125:199 to 1125:1125. The test set remained unprocessed to prevent information leakage.^26,27

Predictor Selection

Following the sampling of the training set, a decision tree model based on Gini coefficient partitioning was used, and the SHapley Additive exPlanations (SHAP) method was used to rank predictor importance.²⁸ The model parameters were configured as follows: maximum tree depth, 5; minimum internal node sample size, 2; minimum leaf node sample size, 1; and minimum impurity reduction threshold, 0. Using the SHAP method, the influence of each variable on the model predictions was analyzed, yielding ranking results (Figure 1). This screening approach, based on model contribution, automatically filtered out variables with limited clinical predictive value or redundancy (such as some demographic variables), thereby ensuring the final model was both concise and efficient. The top six predictors accounting for > 60% of the cumulative feature importance were identified as prior thrombosis history, GCS score, shock index, SIRS score, Simplified Pulmonary Embolism Severity Index (SPESI) score, and total mechanical ventilation duration . These predictors were selected based on their dominant roles in shaping the predictive logic of the model. Furthermore, although the initial feature set included some physiological scores with potentially overlapping information (eg, SIRS, SOFA, APACHE II), the tree-based ensemble models used are inherently insensitive to multicollinearity, which further ensures the robustness of both the feature importance ranking and the final model performance.

Figure 1.

Technical roadmap.

Model Development and Evaluation

Model Development and Validation

Nine machine learning models were implemented in this study: Light Gradient-Boosting Machine (LGBM),²⁹ decision tree, gradient boosting tree, Extreme Gradient Boosting (XGBoost),³⁰ random forest (RF),³¹ naive Bayes, Adaptive Boosting (AdaBoost), Categorical Boosting (CatBoost), and logistic regression classifier. The top three models, based on the AUC, were further evaluated using the bootstrap method to determine the optimal predictive model.³²

Model Performance Comparison

Multiple metrics, such as the F1-score, Matthews correlation coefficient (MCC), recall, specificity, and AUC, were used to comprehensively assess the performance and reliability of the predictive models. The F1-score reflects the harmonic mean of precision and recall, balancing the model's ability to identify positive samples and its prediction reliability. MCC provides a balanced evaluation of classification models in imbalanced datasets, whereas recall emphasizes the effective identification of positive cases to reduce missed diagnoses.³³ Specificity controls false-positive rates, minimizing misdiagnosis risks, and AUC evaluates the overall model robustness across various thresholds. Combining these metrics facilitates a thorough assessment of model accuracy and clinical applicability. Bootstrap resampling was used during model selection and performance comparison to enhance stability and reliability.³² Confidence intervals (CIs) for key indicators, such as AUC, sensitivity, and specificity, were estimated using the bootstrap method, further strengthening the robustness and interpretability of the results.¹⁸

Model Interpretation and Deployment

Model Interpretation

To improve model transparency, the SHAP method was introduced to quantitatively analyze the individual prediction contributions. Based on the cooperative game theory, SHAP calculates the marginal contributions across all feature combinations, ensuring consistent and locally accurate explanations of the model outputs. In clinical prediction models, SHAP reveals global feature importance and provides patient-specific insights, aiding physicians in understanding model judgment. This approach enhances model credibility and supports traceability in clinical decision-making.^28,34

Web-Based Calculator

For practical clinical applications, the final optimal prediction model was integrated into a web-based platform featuring an intuitive graphical user interface. Medical staff can input patient-specific clinical variables (eg, thrombosis history, GCS score, mechanical ventilation duration) into the interface, enabling the system to automatically compute and output the predicted probability of thrombotic events. This tool enhances clinical usability and facilitates personalized risk assessment, thereby assisting physicians with targeted prevention and intervention strategies. By encapsulating complex algorithms in a user-friendly format, the transition from model development to clinical practice can be achieved effectively, thereby supporting precision medicine initiatives.³⁵

Statistical Learning

This study used the DecisionLinnc software for the statistical analyses. Initially, raw data were preprocessed to automatically eliminate all samples containing missing values in any row, ensuring data completeness. Following this, the baseline feature table was used to describe the fundamental characteristics of the data; continuous variables are expressed as medians and IQRs, whereas categorical variables are represented as frequencies and percentages. For grouped data, hypothesis testing was applied to assess differences between variables, with a P value < .05 indicating statistical significance. To further evaluate the balance of each variable's distribution across the groups, the standardized mean difference (SMD) was calculated. This metric is unaffected by the sample size and is appropriate for measuring the distributional differences between groups. Typically, an SMD < 0.1 suggests a balanced distribution of the variable across groups, allowing the difference to be disregarded (Table 1).

Table 1.

The Baseline Table of the Total Dataset, as Well as the Comparison Table of the Training Set and the Test Set.

Variable	Levels	Overall	Thrombosis=0	Thrombosis=1	P-Value	SMD	Test	Train	P-Value	SMD
		N = 1892	N = 1618	N = 274			N = 568	N = 1324
Thrombosis (%)	0	1618 (85.82%)					493 (86.80%)	1125 (84.97%)	.34	0.05
Thrombosis (%)	1	274 (14.48%)					75 (13.20%)	199 (15.03%)
Gender, n (p%)					.33	0.06			.54	0.03
	F (1)	798 (42.18%)	675 (41.72%)	123 (44.89%)			233 (41.02%)	565 (42.67%)
	M (0)	1094 (57.82%)	943 (58.28%)	151 (55.11%)			335. (58.98%)	759 (57.33%)
Age, median (IQR)		75.00 (16.00)	75.00 (15.00)	71.50 (16.00)	<.001	0.25	75.00 (16.00%)	75.00 (16.00)	.51	0.03
Race, n (p%)					.03	0.38			.18	0.27
	WHITE	1396 (73.79%)	1195 (73.86%)	201 (73.35%)			405 (71.30%)	991 (74.85%)
	BLACK	148 (7.46%)	122 (7.54%)	24 (9.49%)			48 (8.45%)	100 (7.79%)
	ASIAN	55 (2.91%)	45 (2.78%)	10 (3.33%)			22 (3.87%)	31 (2.35%)
	HISPANIC OR LATINO	31 (1.65%)	22 (1.35%)	9 (3.27%)			11 (1.94%)	20 (1.51%)
	OTHER	92 (4.87)	85 (5.24%)	7 (2.54%)			32 (5.63%)	62 (4.68%)
	UNKNOWN	170 (8.99%)	149 (9.21%)	21 (7.66%)			50 (8.80%)	120 (9.06%)
Insurance, n (p%)					<.001	0.28			.03	0.18
	Medicaid	148 (7.82%)	116 (7.17%)	32 (11.68%)			51 (8.98%)	97 (7.33%)
	Medicare	1400 (74.00%)	1227 (75.83%)	173 (63.14%)			418 (73.59%)	982 (74.17%)
	Other	18 (0.95%)	14 (0.87%)	4 (1.46%)			0 (0.00)	18 (1.36%)
	Private	316 (16.70%)	252 (15.57%)	64 (23.36%)			95 (16.73%)	221 (16.69%)
	(Missing)	10 (0.53%)	9(0.56%)	1 (0.36%)			4 (0.70%)	6 (0.45%)
Marital status, n (p%)					.38	0.12			.47	0.08
	DIVORCED	121 (6.40%)	102 (6.30%)	19 (6.93%)			40 (7.04%)	81 (6.12%)
	MARRIED	960(50.74%)	821 (50.74%)	139 (50.73%)			273 (48.06%)	687 (51.89%)
	SINGLE	386 (20.40%)	322 (19.90%)	64 (23.36%)			117 (20.60%)	269 (20.32%)
	WIDOWED	302 (15.96%)	266 (16.44%)	36 (13.14%)			98 (17.25%)	204 (15.41%)
	(Missing)	123 (6.50%)	107 (6.62%)	16 (5.84%)			40 (7.05%)	83 (6.26%)
Hemoglobin, median (IQR)		9.90 (2.90)	9.90 (2.90)	10.10 (2.70)	.50	−0.04	9.90 (2.70)	9.90 (3.00)	.85	−0.01
Platelet count, median (IQR)		182.00 (151.00)	179.00 (151.00)	195.00 (147.00)	.68	−0.03	187.00 (149.00)	179.00 (152.00)	.99	0.00
White blood cells, median (IQR)		11.80 (9.40)	11.80 (9.00)	11.90 (11.70)	.24	−0.08	12.15 (9.20)	11.70 (9.30)	.34	0.05
Neutrophils, median (IQR)		81.85 (13.80)	81.45 (13.50)	82.85 (14.30)	.08	−0.11	82.00 (13.00)	81.75 (14.00)	.58	0.03
Lymphocytes, median (IQR)		8.60 (9.80)	9.00 (9.90)	7.05 (8.20)	.02	0.15	8.00 (9.05)	9.00 (10.00)	.55	−0.03
Ventilation hour, median (IQR)		57.00 (84.68)	55.07 (82.47)	68.23 (91.50)	.03	−0.16	64.08 (96.63)	55.35 (79.00)	.08	0.09
Input amount, median (IQR)		4224.66 (4376.31)	4262.16 (4335.24)	3804.73 (4503.50)	.53	−0.04	4560.00 (4615.30)	4166.35 (4199.16)	.03	0.11
Output amount, median (IQR)		1720.00 (1686.00)	1763.00 (1658.00)	1536.00 (1816.00)	.22	0.05	1657.50 (1591.00)	1751.00 (1715.50)	.69	−0.02
Balance, median (IQR)		2288.69 (4037.01)	2266.42 (4012.45)	2468.91 (4304.54)	.16	−0.07	2527.08 (4360.76)	2217.90 (3920.66)	.12	0.08
SOFA, median (IQR)		6.00 (5.00)	6.00 (5.00)	6.00 (5.00)	.24	−0.08	6.00 (5.00)	6.00 (4.00)	.10	0.08
sapsii, median (IQR)		44.00 (19.00)	44.00 (18.00)	46.00 (19.00)	.03	−0.15	46.00 (18.00)	43.00 (18.00)	.00	0.15
oasis, median (IQR)		36.00 (11.00)	35.00 (11.00)	37.00 (10.00)	.02	−0.15	37.00 (11.00)	35.00 (12.00)	.00	0.14
charlson, median (IQR)		7.00 (4.00)	7.00 (4.00)	8.00 (4.00)	.00	−0.19	7.00 (4.00)	7.00 (4.00)	.26	0.06
shock index, median (IQR)		1.25 (0.56)	1.24 (0.55)	1.28 (0.57)	.13	−0.09	1.27 (0.58)	1.23 (0.55)	.05	0.10
apacheii, median (IQR)		20.00 (10.00)	20.00 (9.00)	21.00 (10.00)	.30	−0.07	21.00 (10.00)	20.00 (9.00)	.01	0.14
sirs, n (p%)					.15	0.18			.65	0.08
	0	5 (0.26%)	5 (0.31%)	0 (0.00%)			1 (0.18%)	4 (0.30%)
	1	110 (5.81%)	96 (5.93%)	14 (5.11%)			29 (5.11%)	81 (6.12%)
	2	467 (24.68%)	406 (25.09%)	61 (22.26%)			132 (23.24%)	335 (25.30%)
	3	859 (45.40%)	716 (44.25%)	143 (52.19%)			262 (46.13%)	597 (45.09%)
	4	451 (23.84%)	395 (24.41%)	56 (20.44%)			144 (25.35%)	307 (23.19%)
GCS, n (p%)					.69	0.21			.63	0.16
	3	72 (3.81%)	59 (3.65%)	13 (4.74%)			26 (4.58%)	46 (3.47%)
	4	13 (0.69%)	9 (0.56%)	4 (1.46%)			4 (0.70%)	9 (0.68%)
	5	9 (0.48%)	9 (0.56%)	0 (0.00%)			2 (0.35%)	7 (0.53%)
	6	22 (1.16%)	21 (1.30%)	1 (0.36%)			7 (1.23%)	15 (1.13%)
	7	25 (1.32%)	21 (1.30%)	4 (1.46%)			10 (1.76%)	15 (1.13%)
	8	34 (1.80%)	29 (1.79%)	5 (1.82%)			9 (1.58%)	25 (1.89%)
	9	43 (2.27%)	37 (2.29%)	6 (2.19%)			8 (1.41%)	35 (2.64%)
	10	48 (2.54%)	39 (2.41%)	9 (3.28%)			16 (2.82%)	32 (2.42%)
	11	46 (2.43%)	38 (2.35%)	8 (2.92%)			15 (2.64%)	31 (2.34%)
	12	43 (2.27%)	35 (2.16%)	8 (2.92%)			10 (1.76%)	33 (2.49%)
	13	125 (6.61%)	105 (6.49%)	20 (7.30%)			42 (7.39%)	83 (6.27%)
	14	368 (19.45%)	316 (19.53%)	52 (18.98%)			120 (21.13%)	248 (18.73%)
	15	1044 (55.18%)	900 (55.62%)	144 (52.55%)			299 (52.64%)	745 (56.27%)
spesi score, n (p%)					.71	0.12			.68	0.09
	0	7 (0.37%)	6 (0.37%)	1 (0.36%)			2 (0.35%)	5 (0.38%)
	1	365 (19.29%)	309 (19.10%)	56 (20.44%)			101 (17.78%)	264 (19.94%)
	2	741 (39.16%)	629 (38.88%)	112 (40.88%)			227 (39.96%)	514 (38.82%)
	3	538 (28.44%)	460 (28.43%)	78 (28.47%)			172 (30.28%)	366 (27.64%)
	4	207 (10.94%)	185 (11.43%)	22 (8.03%)			58 (10.21%)	149 (11.25%)
	5	34 (1.80%)	29 (1.79%)	5 (1.82%)			8 (1.41%)	26 (1.96%)
pathos score, n (p%)					.63	0.13			.17	0.14
	0	3 (0.16%)	3 (0.19%)	0 (0.00%)			2 (0.35%)	1 (0.08%)
	1	249 (13.16%)	221 (13.66%)	28 (10.22%)			63 (11.09%)	186 (14.05%)
	2	671 (35.47%)	575 (35.54%)	96 (35.04%)			202 (35.56%)	469 (35.42%)
	3	678 (35.84%)	572 (35.35%)	106 (38.69%)			210 (36.97%)	468 (35.35%)
	4	263 (13.90%)	223 (13.78%)	40 (14.60%)			86 (15.14%)	177 (13.37%)
	5	28 (1.48%)	24 (1.48%)	4 (1.46%)			5 (0.88%)	23 (1.74%)
respiration, n (p%)					.22	0.17			.52	0.09
	0	1321 (69.82%)	1119 (69.16%)	202 (73.72%)			403 (70.95%)	918 (69.34%)
	1	111 (5.87%)	97 (6.00%)	14 (5.11%)			25 (4.40%)	86 (6.50%)
	2	349 (18.45%)	308 (19.04%)	41 (14.96%)			105 (18.49%)	244 (18.43%)
	3	89 (4.70%)	73 (4.51%)	16 (5.84%)			28 (4.93%)	61 (4.61%)
	4	22 (1.16%)	21 (1.30%)	1 (0.36%)			7 (1.23%)	15 (1.13%)
coagulation, n (p%)					.21	0.16			.03	0.17
	0	1263 (66.75%)	1076 (66.50%)	187 (68.25%)			404 (71.13%)	859 (64.88%)
	1	389 (20.56%)	343 (21.20%)	46 (16.79%)			94 (16.55%)	295 (22.28%)
	2	177 (9.36%)	147 (9.09%)	30 (10.95%)			56 (9.86%)	121 (9.14%)
	3	47 (2.48%)	37 (2.29%)	10 (3.65%)			10 (1.76%)	37 (2.79%)
	4	16 (0.85%)	15 (0.93%)	1 (0.36%)			4 (0.70%)	12 (0.91%)
liver, n (p%)					.68	0.09			.01	0.19
	0	1646 (87.00%)	1414 (87.39%)	232 (84.67%)			510 (89.79%)	1136 (85.80%)
	1	99 (5.23%)	83 (5.13%)	16 (5.84%)			19 (3.35%)	80 (6.04%)
	2	99 (5.23%)	83 (5.13%)	16 (5.84%)			20 (3.52%)	79 (5.97%)
	3	27.(1.43%)	21 (1.30%)	6 (2.19%)			11 (1.94%)	16 (1.21%)
	4	21 (1.11%)	17 (1.05%)	4 (1.46%)			8 (1.41%)	13 (0.98%)
cardiovascular, n (p%)					.36	0.16			.40	0.10
	0	569 (30.07%)	484 (29.91%)	85 (31.02%)			175 (30.81%)	394 (29.76%)
	1	856 (45.24%)	736 (45.49%)	120 (43.80%)			249 (43.84%)	607 (45.85%)
	2	12 (0.63%)	12 (0.74%)	0 (0.00%)			6 (1.06%)	6 (0.45%)
	3	219 (11.58%)	191 (11.80%)	28 (10.22%)			61 (10.74%)	158 (11.93%)
	4	236 (12.47%)	195 (12.05%)	41 (14.96%)			77 (13.56%)	159 (12.01%)
aki stage, n (p%)					<.001	0.25			.94	0.02
	1	349 (18.45%)	315 (19.47%)	34 (12.41%)			103 (18.13%)	246 (18.58%)
	2	855 (45.19%)	739 (45.67%)	116 (42.34%)			260 (45.77%)	595 (44.94%)
	3	688 (36.36%)	564 (34.86%)	124 (45.26%)			205 (36.09%)	483 (36.48%)
history thrombosis, n (p%)					<.001	2.10			.52	0.03
	0	1665 (88.00%)	1587 (98.08%)	78 (28.47%)			504 (88.73%)	1161 (87.69%)
	1	227 (12.00%)	31 (1.92%)	196 (71.53%)			64 (11.27%)	163 (12.31%)

Note: A P value less than .05 or a standardized mean difference (SMD) greater than 0.1 indicates a statistically significant difference between groups. These have been highlighted in bold in the table.

After dividing the dataset into training and test sets, the distributions of the variables in both sets were statistically compared to assess the baseline consistency. For continuous variables, the Henze–Zirkler test was initially applied to determine normality (Attachment 3); if the variables did not follow a normal distribution, the rank-sum test was used for group comparisons. The significance threshold was set at P < .05. This analytical approach has been extensively adopted in machine learning modeling studies and effectively ensures the consistency of variable distributions between the training and test sets, thereby enhancing the scientific rigor and robustness of model construction.

Outliers in the training set were detected and removed using the boxplot method (Attachment 4). To address the issue of imbalanced sample categories, the SMOTE technique was used to oversample minority classes and improve the balance of the training data.²⁶

For feature selection, this study integrated the decision tree model with the SHAP method to rank variable importance and ultimately identified six key predictors strongly associated with thrombosis: history of prior thrombosis, GCS score, shock index, SIRS score, SPESI score, and total duration of mechanical ventilation.

Based on these features, nine machine learning models were developed to predict the thrombosis risk, with the area under the receiver operating characteristic curve (AUROC) serving as the performance evaluation metric. Three models demonstrating superior performance were selected for the bootstrap validation phase: RF, LightGBM, and CatBoost. Finally, based on the bootstrap validation results, the performances of the models were comprehensively assessed across six dimensions: F1-score, recall, MCC, AUROC, specificity, and calibration curve, leading to the selection of CatBoost as the final clinical prediction model for this study. Additionally, the Brier Score, Hosmer–Lemeshow test, and calibration curve were calculated to evaluate the calibration performance.

Results

Sample Size

The minimum sample size required was 1600. The final total number of samples actually included was 1894 cases, meeting the EPV ≥ 20 standard, which can effectively support the development and validation of machine learning prediction models and has good generalization ability.¹⁷ The external validation set consisted of 200 independent cases.

Dataset Description and Division

This study used the MIMIC-IV database,¹⁵ which includes clinical data from 1892 ICU patients, to develop a machine learning dataset for predicting thrombosis events. Among these patients, 274 (14.48%) experienced thrombotic events (thrombosis = 1), whereas 1618 (85.52%) did not (thrombosis = 0). To assess the generalization capability of the model and prevent overfitting, the dataset was randomly divided into a training set (N = 1324) and a test set (N = 568) in a 7:3 ratio. The thrombotic event rates in both sets were comparable (15.03% in the training set vs 13.20% in the test set), meeting the stability criteria required for model validation. Baseline characteristics included demographic details (sex, age), clinical variables (preoperative comorbidities, laboratory indicators), and ICU scoring scales. The distribution of these variables was assessed using the SMD, with a median P > .05 or SMD < 0.1, indicating satisfactory balance among variables.³⁶ Predictive factors were identified using a decision tree algorithm, and variables with statistical significance but limited clinical relevance were excluded (eg, SAPS II score: 46.0 in the training set vs 43.0 in the test set, P = .00, SMD = 0.15). A detailed description of the data division process and feature selection strategy is provided in the research design framework (Figure 2), which adheres to the methodological guidelines outlined in the “Clinical Prediction Model Development Guide”.¹⁸ The external validation set from our hospital consisted of 200 patients, with 43 (21.5%) thrombotic events. Its baseline characteristics are shown in Table 2 .

Figure 2.

Order of importance of predictive factors.

Table 2.

Descriptive Statistics of the Validation Set.

Variable Names	Level	Overall	0	1	p	SMD
n		200	157	43
cancer		30 (1-71)	59 (1-71)	1 (1-69)	.723	0.061
age		63.5 (51-77)	62 (50-76)	67 (56-79.5)	.363	0.158
Systolic_blood_pressure		109.5 (104-119)	110 (104-120)	108 (105-115.5)	.232	0.222
T		37.7(36.975-38.125)	37.7 (37-38.1)	37.7 (36.9-38.2)	.882	0.025
HR		127.5 (87-174.25)	128 (83-178)	126 (91-160)	.797	0.046
WBC		28 (13.15-44)	37 (13.2-44)	16.8 (12.95-42.5)	.579	0.097
SPCO2		31 (19-44.25)	26 (19-44)	38 (18-44.5)	.768	0.051
R		22 (19-23)	22 (19-23)	22 (19-23)	.253	0.204
gcs		10.5 (3-15)	10 (3-15)	11 (3-15)	.763	0.052
shock_index		1.141 (0.768-1.58)	1.144 (0.761-1.628)	1.074 (0.813-1.372)	.507	0.122
ventilation_hour		47.7 (23.773-138.562)	46.33 (23-135.75)	81.5 (25.22-141.375)	.938	0.014
Gender (%)	1	149 (74.50)	113 (71.97)	36 (83.72)	.171	0.286
Gender (%)	2	51 (25.50)	44 (28.03)	7 (16.28)
Chronic_lung_disease (%)	0	133 (66.50)	100 (63.69)	33 (76.74)	.154	0.288
Chronic_lung_disease (%)	1	67 (33.50)	57 (36.31)	10 (23.26)
heart_disease (%)	0	121 (60.50)	92 (58.60)	29 (67.44)	.382	0.184
heart_disease (%)	1	79 (39.50)	65 (41.40)	14 (32.56)
sirs (%)	0	9 (4.50)	7 (4.46)	2 (4.65)	.543	0.297
sirs (%)	1	56 (28.00)	45 (28.66)	11 (25.58)
sirs (%)	2	54 (27.00)	38 (24.20)	16 (37.21)
sirs (%)	3	34 (17.00)	28 (17.83)	6 (13.95)
sirs (%)	4	47 (23.50)	39 (24.84)	8 (18.60)
spesi_score (%)	2	57 (28.50)	45 (28.66)	12 (27.91)	.346	0.299
spesi_score (%)	3	101 (50.50)	83 (52.87)	18 (41.86)
spesi_score (%)	4	37 (18.50)	26 (16.56)	11 (25.58)
spesi_score (%)	5	5 (2.50)	3 (1.91)	2 (4.65)
history_thrombosis (%)	0	166 (83.00)	153 (97.45)	13 (30.23)	<.001	1.958
history_thrombosis (%)	1	34 (17.00)	4 (2.55)	30 (69.77)

Table 3.

Comparison of Performance Parameters for Multiple Models on the Test Set.

Model Name	AUROC	MCC	F1-Score	Recall	Specificity
CatBoost	0.890	0.481	0.541	0.787	0.830
LGBM	0.889	0.481	0.545	0.76	0.844
RF	0.886	0.609	0.659	0.787	0.909
XGB	0.885	0.505	0.560	0.813	0.834
Decision tree	0.881	0.355	0.418	0.853	0.661
Logistic	0.864	0.570	0.626	0.76	0.899
GBDT	0.789	0.395	0.458	0.827	0.728
NB	0.787	0.355	0.429	0.787	0.714
AdaBoost	0.761	0.371	0.441	0.8	0.722

Explanation of Predictive Factors

Prior thrombosis history: This criterion evaluates the risk of recurrent thrombotic events based on the presence of thrombosis at the time of admission.1

Glasgow Coma Scale: A widely used to assess consciousness levels, particularly in neurological diseases, head injuries, and postoperative consciousness disorders. The scores range from 3 to 15, covering eye opening, verbal response, and motor response. Scores of 13–15 indicate mild or normal impairment, 9–12 moderate impairment, and ≤ 8 severe impairment.²¹

Shock index: Calculated as the ratio of heart rate to systolic blood pressure (SBP), which assesses circulatory perfusion status. A value > 0.9 suggests potential circulatory dysfunction, whereas ≥ 1.0 is commonly associated with hemorrhagic or septic shock.²⁵

SIRS score: A clinical criterion for identifying systemic inflammatory responses, incorporating abnormal body temperature (> 38 °C or < 36 °C), heart rate > 90 beats/min, respiratory rate > 20 breaths/min or partial pressure of carbon dioxide < 32 mm Hg, white blood cell count > 12 000/mm³ or < 4000/mm, or immature neutrophil count > 10%. Meeting two or more criteria indicates an SIRS state, often used for the early detection of sepsis or severe infections.²²

SPESI score: A simplified tool for evaluating PE severity and predicting short-term mortality risk. Criteria include age > 80 years, cancer history, chronic heart/lung disease, SBP < 100 mm Hg, respiratory rate ≥ 30 breaths/min, heart rate ≥ 110 beats/min, and arterial oxygen saturation < 90%. A total score of 0 indicates low risk, whereas ≥ 1 indicates high risk.

Mechanical ventilation duration: The cumulative time (hours or days) for which a patient receives mechanical ventilation support is clinically significant in the intensive care setting. Extended ventilation periods are associated with increased risks of ventilator-associated pneumonia, airway injury, ICU-acquired weakness, and delirium.

Model Performance Comparison

Among the nine machine learning models compared in this study, overfitting or underfitting occurred in the training set. Therefore, the final results were based on the test set. Among them, CatBoost, LightGBM, and RF performed the best overall, with AUC values on the test set approaching 0.89, demonstrating an excellent classification discrimination ability. CatBoost, ranked first with an AUC of 0.890, was particularly suitable for handling high-dimensional categorical variables and complex data structures and possessed both high accuracy and strong robustness. LightGBM followed closely (AUC = 0.889), and because of its efficient training speed and excellent generalization performance, it has been widely applied in both industry and academia . RF (AUC = 0.886), as a classic ensemble model, also performed stably in capturing nonlinear features and preventing overfitting . Notably, although the decision tree model was the most basic, its AUC was as high as 0.881, approaching that of the mainstream ensemble models. This suggests that in the dataset of this study, there were relatively clear boundaries between variables, thereby enhancing its discrimination ability. Although logistic regression, a traditional model, has limitations in modeling nonlinear relationships, its AUC still reached 0.864, demonstrating its robustness and interpretability as a baseline model in scenarios where features are linearly separable.³⁷ In contrast, although both XGBoost and LightGBM are efficient gradient boosting methods, the AUC of XGBoost was slightly lower (0.885), whereas that of gradient-boosted decision tree (AUC = 0.789) had a significant gap, which might be attributed to differences in the parameter optimization strategies or underlying implementation mechanisms. Additionally, AdaBoost (AUC = 0.761) and naive Bayes (AUC = 0.787) showed relatively weak adaptability and generalization abilities in this study, indicating certain limitations in handling high-dimensional heterogeneous data (Figure 3, Table 3). Therefore, CatBoost, LightGBM, and RF, with their outstanding AUC performances and robustness, entered the bootstrap validation stage for thrombosis prediction.

Figure 3.

Comparison of ROC curves between the training set and the test set of the 9 model.

Validation Results of the Best Model

Based on the bootstrap validation results of the three models (CatBoost, light gradient boosting machine [LGBM], and RF), a comprehensive analysis of AUC, calibration performance, and stability was performed, leading to the following conclusions. In terms of the AUC metric, LGBM performed the best, with an AUC range of 0.841–0.864 and a median value of 0.853, showing no significant abnormal fluctuations, indicating a robust model discrimination ability. Compared with the median AUC of CatBoost at 0.85 (range, 0.838-0.862), LGBM had a slight advantage. However, the median AUC of RF was only 0.799 (range, 0.786-0.812), which was significantly lower than the other two. In terms of model calibration, the calibration error decision curve analysis (DCA) value of CatBoost (0.051-0.059) was significantly lower than that of LGBM (0.057-0.071), indicating a high consistency between the predicted probabilities and the actual observed proportions. Additionally, CatBoost demonstrated a good balance between high sensitivity (0.979) and moderate specificity (0.717) at the optimal threshold (0.276), making it suitable for clinical scenarios requiring early risk warning. Although the LGBM showed the highest AUC in a single test (BR3TEST = 0.878), its performance was unstable in some test sets (eg, BR9TEST = 0.845), and it had a relatively high calibration error, which might cause the predicted probabilities to deviate from the actual observed values. RF, although it had an outstanding AUC in some test sets (BR3TEST = 0.887), had severe outliers (BR6TEST = 0.527, BR7TEST = 0.509), indicating insufficient stability. In summary, CatBoost was determined to be the core model because of its high discrimination ability, precise calibration, and good generalization performance, and its overall performance was significantly superior to that of LGBM and RF (Figure 4).

Figure 4.

Comparison of the performance of three models.

Model Calibration

Model calibration enhances decision reliability by aligning predicted probabilities with actual observed probabilities. Prior to calibration, the Brier score of the CatBoost model was 0.054, and the calibration curve deviated from the diagonal line, indicating a systematic probability bias. Following calibration using the Platt Scaling method, the Brier score decreased to 0.053, and the calibration curve closely approximated the ideal diagonal line (Figure 5), confirming the improved probability output reliability. The Hosmer–Lemeshow test (P > .05) validated the good fit of the calibrated model, whereas the uncalibrated model failed this test, signifying inadequate calibration performance.^38,39

Figure 5.

Comparison before and after calibration, and in the validation set for CatBoost.

Although calibration did not alter the model's discrimination ability (AUC = 0.855; 95% CI, 0.797-0.913), the optimal decision threshold decreased from 0.397 to 0.245, potentially reducing the likelihood of missed high-risk diagnoses while preserving sensitivity (0.971) and specificity (0.753). DCA revealed that the calibrated model yielded a higher net benefit than conventional strategies within the risk threshold range of 0.25–0.75 (Figure 5), underscoring its clinical utility. However, the net benefit diminished in the high-threshold range (> 0.75), suggesting limited discrimination ability or a risk of overestimating probabilities in high-risk areas.⁴⁰ Such phenomena may be associated with local data sparsity or the limitations of the calibration method, emphasizing the importance of evaluating model practicality in conjunction with clinical decision thresholds to avoid masking performance deficiencies in critical intervals using global metrics.

External Validation Results

To evaluate the model's generalizability, we performed external validation using an independent dataset of 200 ICU cancer patients from our hospital. This external validation set had similar demographic and clinical characteristics but a slightly higher thrombosis incidence (21.5% vs 14.48%), reflecting real-world data diversity.

The model demonstrated good discriminative ability in the external validation set, with an AUC of 0.83 (95% confidence interval: 0.742-0.918), slightly lower than the internal validation result but still significantly better than traditional scoring systems (eg, Caprini, Padua).⁸ At the optimal threshold of 0.307, the model's sensitivity was 0.968 and specificity was 0.698, indicating strong ability to identify thrombotic events, though specificity decreased, possibly due to differences in external data distribution.

The calibration curve showed good consistency between the model's predicted probabilities and actual observed probabilities, with a calibration error of 0.078 (Figure 5). Compared to internal validation, the calibration error increased slightly but remained acceptable, suggesting the model maintained good calibration performance across different institutions. Decision curve analysis (DCA) further verified the model's clinical utility: within the risk threshold range of 0.25–0.75, the model's net benefit was significantly higher than the “treat all” or “treat none” strategies (Figure 5), but net benefit decreased at high-risk thresholds (>0.75), consistent with internal validation results.

The external validation results overall support the robustness and applicability of the CatBoost model in real clinical environments, although there may be room for optimization in specificity and calibration in high-risk intervals. Future work could involve fine-tuning model parameters with multi-center data or adopting adaptive threshold strategies to improve performance.

Visualization Analysis and Clinical Decision Support

High Discriminative Power and Calibration Consistency

The CatBoost calibration model exhibited exceptional discriminative ability (AUC = 0.855; 95% CI, 0.797-0.913) in predicting thrombotic events among critically ill patients with malignant tumors, surpassing traditional scoring systems, such as Caprini, Padua, and Khorana.⁷ At the optimal threshold of 0.245, the model achieved a favorable balance between sensitivity (0.971) and specificity (0.753), making it suitable for early screening applications. The calibrated predicted probabilities were closely aligned with the actual outcomes (Brier score = 0.053), and DCA confirmed the net benefit advantage within the risk threshold range of 0.245–0.75 (Figure 5). With the calibrated threshold set at 0.245, for patients with a thrombosis risk below this value, the model minimizes unnecessary interventions in low-risk populations (as indicated by DCA showing negative or negligible net benefits in this range); for those with a risk ≥ 0.245, intervention is recommended if no bleeding contraindications exist (positive DCA net benefit and higher-than-predicted actual thrombosis probability in high-risk groups). Notably, when the risk exceeds 0.75, despite the slower growth rate of the DCA net benefit, priority intervention remains essential because of the model's conservative prediction of high-risk scenarios, unless clear bleeding contraindications are present.⁴¹

Model Interpretation Based on SHapley Additive exPlanations Plots

Global Feature Priority

The SHAP importance plot highlights that history_thrombosis (history of thrombosis) is the most influential feature (longest bar), followed by ventilation_hour (mechanical ventilation duration), which collectively drives the model's prediction of thrombosis risk (Figure 6). This aligns closely with the significant contributions of these features in the force-directed graph (Figure 7) for individual samples (eg, negative SHAP values for no history of thrombosis and reduced risk from prolonged ventilation) and their prominent distributions in the bees plot (Figure 8) at the population level (purple/yellow values representing high/low risks). Taken together, they form a coherent chain of interpretations that link global, individual, and population-level insights.

Figure 6.

SHAP importance plot.

Figure 7.

SHAP bees plot.

Figure 8.

SHAP heat force plot.

Population Feature: Risk Distribution

For history_thrombosis: A positive history (purple) corresponds to a positive SHAP value (increased risk), whereas no history (yellow) results in a negative SHAP value (reduced risk), consistent with the negative contribution observed in the force-directed graph for samples without a history (eg, f[x] = –5.86 below baseline). This reinforces the population-level rule that “no history of thrombosis” serves as a low-risk marker.

For ventilation_hour: Longer ventilation times (purple) yield negative SHAP values (reduced risk), whereas shorter durations (yellow) produce positive SHAP values (increased risk). This explains the negative impact of extended ventilation observed in the force-directed graph (eg, 170 h lowering the predicted value) and underscores the need for heightened vigilance regarding thrombosis in patients with insufficient early postoperative ventilation.

Other features (eg, spesi_score, gcs): Higher scores (purple) correspond to positive SHAP values (higher risk, eg, elevated SPESI scores indicating severe PE and increased thrombosis likelihood), whereas lower scores (yellow) result in negative SHAP values (lower risk, eg, normal GCS score reflecting stable circulation and reduced risk²¹). These findings complement single-sample analyses in force-directed graphs and elucidate feature-risk associations at the population level (Figure 7).

Individual Feature Contribution Decomposition

Using a specific sample as an example, the force-directed graph visually demonstrates how core features (history_thrombosis = 0, ventilation_hour = 170) dominate the low-risk prediction through substantial negative SHAP values (–0.907, −0.77), consistent with the yellow/leftward distribution of no history and long ventilation in the bees plot. Auxiliary features (shock_index, GCS) contributed to smaller negative adjustments, reflecting the model's comprehensive evaluation of multidimensional clinical indicators (eg, modest risk reduction associated with a shock index of 1.28, requiring clinical judgment of the complex relationship between shock and thrombosis) (Figure 8).

Model Interpretation and Clinical Application

Risk stratification:

− High risk: Presence of thrombosis history (history_thrombosis = 1), short ventilation time (ventilation_hour short, yellow value), and elevated SPESI/SIRS scores (purple value) → Intensify anticoagulation measures (eg, increase dose, shorten monitoring intervals) to prevent thrombosis recurrence or progression.

− Low risk: Absence of thrombosis history (history_thrombosis = 0), prolonged ventilation (ventilation_hour long, purple value), normal consciousness (GCS score = 15, purple value), and reduced anticoagulation intensity (eg, decreased dose, extended monitoring intervals) to minimize bleeding complications.

Enhanced explainability:

The three SHAP plots progress from global feature ranking (importance plot) to group feature-risk distribution (bees plot) to individual feature contribution (force plot), enabling clinicians to clearly understand the following:

− Why a patient is considered high-risk (eg, “thrombosis history + short ventilation” → accumulation of positive SHAP values increases risk).

− Why a patient is deemed low-risk (eg, “no history + long ventilation” → dominance of negative SHAP values reduces risk).

This interpretability strengthens trust in the artificial intelligence (AI) model and facilitates data-driven and precise anticoagulation strategies in clinical practice, thereby balancing treatment efficacy and safety. Through the collaborative interpretation of the three SHAP plots, the model provides an explainable thrombosis risk analysis from the global to the individual level. Core features (history, ventilation time) guide risk assessment, and auxiliary features refine predictions, offering a robust framework for clinical workflows: screening (global), group management (bees plot), and personalized decision-making (force plot). This not only validates the model's clinical utility but also integrates AI predictions with clinical expertise via visualized feature contributions, optimizing thrombosis risk management and improving patient outcomes.

Web Calculator Implementation

As shown in Figure 9, this model is embedded in a web-based application (https://fast.statsape.com/tool/detail?id = 15), where inputting the values or scores of the six predictive factors will generate the probability of thrombosis. Currently, the validity period of this model is one year.

Figure 9.

Web calculator.

Discussion

Limitations of the Study Population and Data Source

The generalizability of our findings is constrained by the characteristics of our study population and data source. First, the model was developed and validated using data from adults (≥18 years), excluding adolescents and patients with pregnancy-associated tumors. Given the substantial differences in treatment modalities, hemodynamic profiles, and coagulation mechanisms, the VTE risk pathways in pediatric patients are fundamentally distinct from those in adults, precluding direct extrapolation. Similarly, pregnancy constitutes a high-thrombotic-risk state, which is further amplified by co-occurring malignancy.⁴² Predictive modeling for these populations must incorporate their specific pathophysiological changes. Second, our reliance on the single-center MIMIC-IV database from the United States limits the racial and regional diversity of our dataset. Known racial disparities in VTE incidence and genetic predispositions—such as the higher risk observed in African Americans and the relatively lower risk in Asian populations—mean that the absence of race-specific subgroup analyses may affect model performance across different ethnic groups.⁴³ Therefore, future studies focusing on these specific populations and employing large-scale, multi-center, multi-ethnic datasets for external validation are essential to assess and improve the model's broad applicability.⁷

Notwithstanding these limitations, this study took a critical step toward clinical translation by successfully validating the CatBoost model using an independent dataset of 200 patients from our institution. Although the sample size of the external validation set limits the precision of performance estimates—as indicated by wider confidence intervals—the 43 observed thrombotic events provide a clinically and statistically meaningful basis for evaluation. The evidence obtained supports the satisfactory discriminative ability and calibration of the model within our institutional setting, offering a pragmatic foundation for considering its local implementation. Future efforts to aggregate data across multiple centers of comparable scale will help refine performance estimates and enhance the model's broader applicability.

Model Technical Limitations and Future Development Directions

Our model possesses certain technical limitations that point to clear directions for future development. Firstly, it relies heavily on static clinical variables, such as prior thrombosis history and scores at ICU admission. While these features are critical, they fail to capture dynamic changes during the ICU stay and are prone to underreporting, limiting the model's capacity to track evolving risks. Secondly, despite including key variables, the model does not explicitly account for potential interaction effects, such as those between systemic inflammatory response (SIRS) and pulmonary embolism severity (SPESI), which may synergistically modulate thrombosis risk. The current univariate contribution modeling can underestimate the clinical significance of such interactions.

To address these limitations, future research will focus on two key methodological extensions:

Integration of Dynamic Modeling: Incorporating time-series data (eg, continuous vital signs, laboratory results) using architectures like recurrent neural networks (RNNs) or Transformers to construct a dynamic and adaptive framework for real-time risk prediction.^44,45

Enhanced Interaction Modeling: Explicitly modeling interaction terms (eg, between APACHE II score and thrombosis history,⁴⁶ or between mechanical ventilation duration and shock index⁴⁷ or employing algorithms adept at capturing feature interactions, such as factorization machines, to better capture underlying clinical complexity and improve predictive power.

The adopted CatBoost framework offers a robust foundation for these enhancements, as it inherently handles complex feature interactions and nonlinear relationships while remaining robust to multicollinearity.⁴⁸ This robustness allows us to prioritize clinical relevance over strict statistical independence during feature engineering, thereby preserving potentially meaningful variables without compromising model stability.⁴⁹ Combined with post-hoc interpretability methods like SHAP, this approach ensures a balance between high predictive performance and model interpretability.²⁸

Generalizability of Validation and Decision Thresholds

The external validation process revealed important considerations regarding model generalizability. The external validation with 200 cases from our hospital confirmed the model's effectiveness in an independent dataset, but the slight decrease in AUC and calibration performance suggests the model might be affected by data distribution shifts or local treatment patterns.¹³ This emphasizes the necessity for continuous validation and model updates in multi-center environments.

A related issue is the generalizability of the model's fixed decision threshold (0.245), which was optimized internally via the Youden index. This is evidenced by the different optimal threshold (0.307) identified during external validation. In multi-center applications, such threshold instability could impair performance due to population and procedural heterogeneity³⁸ Therefore, future work must employ multi-center external validation and develop adaptive thresholding methods to ensure robust clinical utility.

It is particularly important to note that the 200 patient external validation conducted for this institution is only a preliminary verification. Therefore, when applying these findings to different clinical scenarios from this study, a cautious approach should be adopted.

Clinical Translation and Concluding Outlook

The ultimate goal of this predictive model is to serve clinical practice.⁵⁰ A feasible integration path is to embed it as a plugin within the hospital's electronic medical record (EMR) system, providing automated risk calculation and decision support without significantly increasing clinical workload. However, wide adoption faces challenges, including technical integration with heterogeneous hospital information systems, the critical need to cultivate clinical doctors’ trust in the AI's decision-making logic,¹⁰ and navigating evolving regulatory approval requirements for clinical decision support software.¹⁰

Looking ahead, we plan to confirm the model's universality through large-scale multi-center external validation,⁵¹ a recognized best practice for validating clinical AI. Concurrently, we will leverage explainable AI technologies like SHAP to enhance clinical understanding and trust, gradually promoting the smooth transition of this tool from research to clinical practice.⁵² By integrating dynamic features and interaction effects as outlined previously, we anticipate a significant enhancement in the model's precision, clinical utility, and potential for personalized risk assessment and intervention strategies.⁵⁰

Finally, while machine learning provides the opportunity to analyze highly complex datasets and develop sophisticated models with greater predictive accuracy, it is imperative to underscore that key decisions throughout the analytical process—from study design and feature selection to model interpretation—must remain the responsibility of the researchers. This process should be guided by a strong conceptual framework and deep clinical expertise. Such a framework not only allows for a meaningful understanding of the model's findings but also ensures a clear recognition of its limitations, thereby safeguarding against the over-interpretation of results and promoting the responsible integration of artificial intelligence into clinical practice.

Conclusion

This study developed and validated a CatBoost-based machine learning model to predict thrombosis risk in critically ill cancer patients. The model exhibited high discriminative ability, favorable calibration, and clinical interpretability in both internal and external validation datasets. The results of the preliminary external validation support the model's potential real-world applicability, offering a promising tool for individualized thrombosis risk assessment. Future large-scale, multi-center studies are essential to confirm these findings and solidify the model's generalizability before broad clinical adoption.

Supplemental Material

sj-xlsx-1-cat-10.1177_10760296251408357 - Supplemental material for CatBoost Machine Learning Model for Thrombosis Risk Prediction in Critically Ill Cancer Patients: A MIMIC-IV Database Study

Supplemental material, sj-xlsx-1-cat-10.1177_10760296251408357 for CatBoost Machine Learning Model for Thrombosis Risk Prediction in Critically Ill Cancer Patients: A MIMIC-IV Database Study by Chang Yang, Hongli Ma, Xianzhang Zeng, Jing Yang, Bang Xiao, Ruyi Tan, Yuanfei Liu and Qin Zeng in Clinical and Applied Thrombosis/Hemostasis

Footnotes

Acknowledgments

We express our sincere gratitude to the team members associated with the MIMIC database for their contributions to the gathering and structuring of publicly available data. The organization providing funding played no role in any part of this research, such as planning, data acquisition, analysis of results, interpretation of findings, or preparation of the manuscript.

We would like to thank Editage () for English language editing.

ORCID iD

Yuanfei Liu

Ethics Approval and Consent to Participate

This study utilized the MIMIC-IV database (version 2.2), which was approved by the Institutional Review Board (IRB) of the Massachusetts Institute of Technology (MIT) (Protocol #0403000206). All data were fully de-identified, and the requirement for individual patient consent was waived by the MIT IRB in accordance with the U.S. Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule.

Consent for Publication

Not applicable. This study used fully anonymized data from the MIMIC-IV database, and no individual patient identifiers were included in the manuscript.

Author Contributions

QZ: Conceptualization, investigation, writing – original draft, writing – review & editing. YFL: Data curation, methodology, writing – review & editing. CY: Data extraction, resources, writing – review & editing. HLM: Data curation, project administration, writing – original draft, writing – review & editing. XZZ: Conceptualization, methodology, supervision, validation, writing – review & editing. BX: Investigation, writing – review & editing. JY: Methodology, writing – review & editing. RYT: Investigation, writing – review & editing.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Chongqing Science and Technology and Health Joint Medical Research Youth Project, (grant number No.: 2022QNXM021).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets analyzed during this study are available in the MIMIC-IV repository () after completing the required training course and data use agreement.

All data generated or analyzed during this study are included in this published article (and its Supplementary Information files).

Supplemental Material

Supplemental material for this article is available online.

References

Khorana

Francis

Culakova

Kuderer

Lyman

. Thromboembolism is a leading cause of death in cancer patients receiving outpatient chemotherapy. J Thromb Haemost. 2007;5(3):632‐634.

Carini

Angriman

Scales

, et al. Venous thromboembolism in critically ill adult patients with hematologic malignancy: a population-based cohort study. Intensive Care Med. 2024;50(2):222‐233.

Moheimani

Jackson

. Venous thromboembolism: classification, risk factors, diagnosis, and management. Int Sch Res Notices. 2011;2011(1):124610.

Azoulay

Schellongowski

Darmon

, et al. The intensive care medicine research agenda on critically ill oncology and hematology patients. Intensive Care Med. 2017;43(9):1366‐1382.

Raskob

Angchaisuksiri

Blanco

, et al., eds. Thrombosis: a major contributor to global disease burden. Seminars in thrombosis and hemostasis. Thieme Medical Publishers; 2014.

Hayssen

Sahoo

Nguyen

, et al. Ability of Caprini and Padua risk-assessment models to predict venous thromboembolism in a nationwide veterans affairs study. J Vasc Surg Venous Lymphat Disord. 2024;12(2):101693.

Lukaszuk

Dolna-Michno

Plens

Czyzewicz

Undas

. The comparison between Caprini and Padua VTE risk assessment models for hospitalised cancer patients undergoing chemotherapy at the tertiary oncology department in Poland: is pharmacological thromboprophylaxis overused? Contemp Oncol (Pozn). 2018;22(1):31‐36.

Mulder

Candeloro

Kamphuisen

, et al. The Khorana score for prediction of venous thromboembolism in cancer patients: a systematic review and meta-analysis. Haematologica. 2019;104(6):1277‐1287.

Bishop

Nasrabadi

. Pattern recognition and machine learning. Springer; 2006.

10.

Rajkomar

Dean

Kohane

. Machine learning in medicine. N Engl J Med. 2019;380(14):1347‐1358.

11.

Bzdok

Altman

Krzywinski

. Statistics versus machine learning. Nat Methods. 2018;15(4):233‐234.

12.

Rajkomar

Oren

Chen

, et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med. 2018;1(1):18.

13.

Franco-Moreno

Madronal-Cerezo

Munoz-Rivas

Torres-Macho

Ruiz-Giardin

Ancos-Aracil

. Prediction of venous thromboembolism in patients with cancer using machine learning approaches: a systematic review and meta-analysis. JCO Clin Cancer Inform. 2023;7:e2300060.

14.

Mrad

Al Qurashi

Shah Mardan

QNM

, et al. Venous thromboembolism risk assessment models in plastic surgery: a systematic review and meta-analysis. Plast Reconstr Surg Glob Open. 2022;10(12):e4683.

15.

Johnson

AEW

Bulgarelli

Shen

, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023;10(1):1.

16.

Lee

Kim

Lee

, et al. Deep learning-based dynamic risk prediction of venous thromboembolism for patients with ovarian cancer in real-world settings from electronic health records. JCO Clin Cancer Inform. 2024;8:e2300192.

17.

Bedogni

. Clinical prediction models—a practical approach to development, validation and updating. Oxford University Press; 2009.

18.

Steyerberg

Vergouwe

. Towards better clinical prediction models: seven steps for development and an ABCD for validation. Eur Heart J. 2014;35(29):1925‐1931.

19.

Bone

Balk

Cerra

, et al. Definitions for sepsis and organ failure and guidelines for the use of innovative therapies in sepsis. The ACCP/SCCM Consensus Conference Committee. American College of Chest Physicians/Society of Critical Care Medicine. Chest. 1992;101(6):1644‐1655.

20.

Vincent

Moreno

Takala

, et al. The SOFA (sepsis-related organ failure assessment) score to describe organ dysfunction/failure. On behalf of the working group on sepsis-related problems of the European society of intensive care medicine. Intensive Care Med. 1996;22(7):707‐710.

21.

Teasdale

Jennett

. Assessment of coma and impaired consciousness. A practical scale. Lancet. 1974;2(7872):81‐84.

22.

Baddam

Burns

. Systemic inflammatory response syndrome. In: StatPearls [Internet]. StatPearls Publishing; 2025:920-935.

23.

Knaus

Draper

Wagner

Zimmerman

. APACHE II: a severity of disease classification system. Crit Care Med. 1985;13(10):818‐829.

24.

Jiménez

Aujesky

Moores

, et al. Simplification of the pulmonary embolism severity index for prognostication in patients with acute symptomatic pulmonary embolism. Arch Intern Med. 2010;170(15):1383‐1389.

25.

Yuksel

Caliskan

, et al.

Are prehospital shock, modified shock, age-adjusted shock indices and some scoring systems effective in predicting the prognosis of high-energy trauma patients?

Int Emerg Nurs. 2025;83:101678.

26.

Fernández

García

Galar

Prati

Krawczyk

Herrera

. Learning from imbalanced data sets. Springer; 2018.

27.

Aubaidan

Kadir

Lajb

, et al. A review of intelligent data analysis: Machine learning approaches for addressing class imbalance in healthcare-challenges and perspectives. Intell Data Anal. 2025;29(3):699‐719.

28.

Lundberg

Erion

Chen

, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2(1):56‐67.

29.

Meng

Finley

, et al. Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:30-52.

30.

Chen

Guestrin

, eds. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining; 2016.

31.

Breiman

. Random forests. Mach Learn. 2001;45(1):5‐32.

32.

Tibshirani

Efron

. An introduction to the bootstrap. Monogr Statist Appl Probab. 1993;57(1):1‐436.

33.

Chicco

Jurman

. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020;21(1):6.

34.

Ribeiro

Singh

Guestrin

, eds. “Why should i trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016.

35.

Geller

. Food and drug administration published final guidance on clinical decision support software. J Clin Eng. 2023;48(1):3‐7.

36.

Austin

. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med. 2009;28(25):3083‐3107.

37.

Christodoulou

Collins

Steyerberg

Verbakel

Van Calster

. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12‐22.

38.

Van Calster

Vickers

. Calibration of risk prediction models: impact on decision-analytic performance. Med Decis Making. 2015;35(2):162‐169.

39.

Van Calster

McLernon

Van Smeden

, et al. Calibration: the achilles heel of predictive analytics. BMC Med. 2019;17(1):230.

40.

Vickers

van Calster

Steyerberg

. A simple, step-by-step guide to interpreting decision curve analysis. Diagn Progn Res. 2019;3(1):18.

41.

Van Calster

Wynants

Verbeek

, et al. Reporting and interpreting decision curve analysis: A guide for investigators. Eur Urol. 2018;74(6):796‐804.

42.

Afifi

Leverich

Tadrousse

Ren

Nazzal

. Racial, biological sex, and geographic disparities of venous thromboembolism in the United States, 2016 to 2019. J Vasc Surg. 2025;81(2):A15.

43.

Athale

Siciliano

Thabane

, et al. Epidemiology and clinical risk factors predisposing to thromboembolism in children with cancer. Pediatr Blood Cancer. 2008;51(6):792‐797.

44.

Sheikhalishahi

Bhattacharyya

Celi

Osmani

. An interpretable deep learning model for time-series electronic health records: Case study of delirium prediction in critical care. Artif Intell Med. 2023;144:102659.

45.

Yadaw

Y-c

Bose

Iyengar

Bunyavanich

Pandey

. Clinical features of COVID-19 mortality: development and validation of a clinical prediction model. Lancet Digit Health. 2020;2(10):e516‐ee25.

46.

Han

Zhou

Wang

, et al. Linear inverse association between prognostic nutritional index and colorectal cancer risk based on NHANES data. Sci Rep. 2025;15(1):25647.

47.

Hao

Wang

Zhu

Herasevich

. Applying machine learning for perioperative adverse event prediction: a narrative review toward better clinical efficacy and usability. Anesthesiol Perioper Sci. 2025;3(4):51.

48.

Prokhorenkova

Gusev

Vorobev

Dorogush

Gulin

. CatBoost: unbiased boosting with categorical features. Adv Neural Inf Process Syst. 2018;31:6639-6649.

49.

Babyak

. What you see may not be what you get: A brief, nontechnical introduction to overfitting in regression-type models. Biopsychosoc Sci Med. 2004;66(3):411‐421.

50.

Chen

Williamson

Mahmood

. Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng. 2021;5(6):493‐497.

51.

Xiao

Choi

Sun

. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. J Am Med Inform Assoc. 2018;25(10):1419‐1428.

52.

Collins

Reitsma

Altman

Moons

. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. J Br Surg. 2015;102(3):148‐158.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.03 MB