Machine learning-based analysis of risk factors for chronic total occlusion in an Asian population

Abstract

Objectives

Chronic total occlusion (CTO) is a form of coronary artery disease (CAD) requiring percutaneous coronary intervention. There has been minimal research regarding CTO-specific risk factors and predictive models. We developed machine learning predictive models based on clinical characteristics to identify patients with CTO before coronary angiography.

Methods

Data from 1473 patients with CAD, including 317 patients with and 1156 patients without CTO, were retrospectively analyzed. Partial least squares discriminant analysis (PLS-DA), random forest (RF), and support vector machine (SVM) models were used to identify CTO-specific risk factors and predict CTO development. Receiver operating characteristic (ROC) curve analysis was performed for model validation.

Results

For CTO prediction, the PLS-DA model included 10 variables; the ROC value was 0.706. The RF model included 42 variables; the ROC value was 0.702. The SVM model included 20 variables; the ROC value was 0.696. DeLong’s test showed no difference among the three models. Four variables were present in all models: sex, neutrophil percentage, creatinine, and brain natriuretic peptide (BNP).

Conclusions

Validation of machine learning prediction models for CTO revealed that the PLS-DA model had the best prediction performance. Sex, neutrophil percentage, creatinine, and BNP may be important risk factors for CTO development.

Keywords

Chronic total occlusion coronary artery disease machine learning risk factor prediction coronary angiography echocardiography brain natriuretic peptide creatinine

Introduction

Chronic total occlusion (CTO) occurs when a coronary artery is completely blocked, resulting in complete cessation of blood flow to the affected area.¹ The condition is characterized by thrombolysis in myocardial infarction flow grade 0, with a minimum duration of 3 months. It has been estimated that CTO affects 16% to 20% of patients who undergo coronary angiography (CAG) for the diagnosis of coronary artery disease (CAD).² Despite its high prevalence, CTO recanalization is limited because of low success rates, high complications, long procedure times, and substantial costs.³ Considering the challenges associated with CTO recanalization, early detection of risk factors can facilitate earlier intervention. This detection might be achieved by identifying easily measurable predictors that can be used to recognize patients with CTO risk. This information could help healthcare providers to prevent or mitigate the impact of CTO, thereby improving patient outcomes.

In recent years, some studies have analyzed risk factors associated with CAD incidence, such as demographic information, biochemistry, and echocardiography. However, none of these studies have specifically focused on clinical characteristics associated with CTO; such characteristics could generate more accurate and personalized risk assessments.⁴ Given the technical difficulties associated with CTO recanalization, as well as its high cost and limited success, there is a clear need for more effective risk stratification and prediction tools.⁵ Through analyses of CTO data embedded in clinical characteristics, along with the use of machine learning algorithms, early and accurate identification of CTO may lead to more effective treatment and better patient outcomes.⁶

Here, we conducted a risk factor analysis and constructed three models to predict CTO risk based on clinical and demographic characteristics, as well as biochemical parameters and echocardiography findings. Computational analysis revealed that these models can provide clinically relevant predictions of CTO risk, highlighting the emerging opportunity to meet an important need in this field.

Methods

Study cohort

This retrospective study included consecutive patients with unstable angina pectoris who underwent CAG at Beijing Anzhen Hospital, Capital Medical University, between January 2019 and December 2020. Exclusion criteria were previous myocardial infarction, previous percutaneous coronary intervention, and/or previous coronary artery bypass grafting. Patients were divided into two groups according to procedural outcome: CTO and non-CTO. CTO was defined as a coronary lesion without thrombolysis in myocardial infarction flow grade 0 and a minimum duration of 3 months⁷; non-CTO was defined as stenosis involving ≥50% of the luminal diameter in one branch of the main coronary artery.

The study protocol was approved by the Human Research Ethics Committee of Beijing Anzhen Hospital, Capital Medical University (approval no. 2022177X); it adhered to the principles of the Declaration of Helsinki (as revised in 2013). Considering the retrospective study design, the Human Research Ethics Committee waived the requirement for informed consent. All patient information was anonymized prior to analysis. This article adhered to the TRIPOD reporting checklist.⁸

Data collection

Data were extracted by YS and WJ independently, and any disagreements were resolved through discussion with a third investigator (JL). Baseline patient clinical and demographic characteristics, biochemical parameters, and echocardiography findings were recorded for each patient. Current tobacco use was defined as smoking >1 cigarette per day for at least 6 months. Fasting blood samples were collected for laboratory tests to measure a wide range of blood parameters (listed in Table 1). The atherogenic index of plasma was calculated as log10 (triglycerides/high-density lipoprotein cholesterol [HDL-C]). The triglyceride-glucose (TyG) index was calculated as ln [fasting triglycerides (mg/dL) × fasting glucose (mg/dL)/2]. The MONO/LYM ratio was calculated as monocytes/lymphocytes. The left ventricular end diastolic diameter (LVEDD) and ejection fraction (EF) were determined by echocardiography. If more than 30% of data were missing for a single patient, that patient was excluded from the analysis. Missing data were imputed using the missForest package in R4.2.0 software (R Foundation for Statistical Computing, Vienna, Austria), which performs nonparametric missing value imputation using a random forest approach.

Table 1.

Baseline clinical characteristics of patients in this study*

Variables	Derivation cohort	CTO		Validation cohort	CTO
Variables	(n = 982)	No (n = 782)	Yes (n = 200)	(n = 491)	No (n = 374)	Yes (n = 117)
Sex, n (%)
Female	282 (28.72)	253 (32.35)	29 (14.5)	142 (28.92)	120 (32.09)	22 (18.8)
Male	700 (71.28)	529 (67.65)	171 (85.5)	349 (71.08)	254 (67.91)	95 (81.2)
Smoking, n (%)
No	510 (51.93)	433 (55.37)	77 (38.5)	266 (54.18)	207 (55.35)	59 (50.43)
Yes	472 (48.07)	349 (44.63)	123 (61.5)	225 (45.82)	167 (44.65)	58 (49.57)
Drinking, n (%)
No	650 (66.19)	526 (67.26)	124 (62)	344 (70.06)	260 (69.52)	84 (71.79)
Yes	332 (33.81)	256 (32.74)	76 (38)	147 (29.94)	114 (30.48)	33 (28.21)
Hypertension, n (%)
No	351 (35.74)	282 (36.06)	69 (34.5)	182 (37.07)	135 (36.1)	47 (40.17)
Yes	631 (64.26)	500 (63.94)	131 (65.5)	309 (62.93)	239 (63.9)	70 (59.83)
Diabetes mellitus, n (%)
No	676 (68.84)	545 (69.69)	131 (65.5)	322 (65.58)	250 (66.84)	72 (61.54)
Yes	306 (31.16)	237 (30.31)	69 (34.5)	169 (34.42)	124 (33.16)	45 (38.46)
Stroke, n (%)
No	892 (90.84)	712 (91.05)	180 (90)	448 (91.24)	340 (90.91)	108 (92.31)
Yes	90 (9.16)	70 (8.95)	20 (10)	43 (8.76)	34 (9.09)	9 (7.69)
Dyslipidemia, n (%)
No	206 (20.98)	165 (21.1)	41 (20.5)	104 (21.18)	72 (19.25)	32 (27.35)
Yes	776 (79.02)	617 (78.9)	159 (79.5)	387 (78.82)	302 (80.75)	85 (72.65)
Age (years)	60.26 ± 9.40	60.42 ± 9.13	59.62 ± 10.40	60.35 ± 9.31	60.41 ± 9.18	60.16 ± 9.73
BMI (kg/m²)	25.82 ± 3.25	25.77 ± 3.23	25.99 ± 3.31	25.98 ± 3.20	25.88 ± 3.18	26.28 ± 3.27
SP (mmHg)	131.65 ± 16.68	132.10 ± 16.50	129.88 ± 17.30	131.37 ± 17.00	132.35 ± 17.15	128.26 ± 16.23
DP (mmHg)	78.37 ± 10.95	78.36 ± 10.62	78.38 ± 12.16	77.18 ± 10.66	77.68 ± 11.08	75.56 ± 9.05
WBC (10¹²/L)	7.16 ± 2.02	7.03 ± 1.98	7.63 ± 2.13	7.04 ± 1.99	6.86 ± 1.80	7.64 ± 2.43
RBC (10¹²/L)	4.57 ± 0.50	4.58 ± 0.48	4.55 ± 0.56	4.50 ± 0.54	4.51 ± 0.53	4.46 ± 0.57
PLT (10⁹/L)	221.74 ± 55.28	222.65 ± 53.94	218.21 ± 60.21	216.93 ± 60.97	219.98 ± 61.96	207.20 ± 56.83
Hb (g/L)	140.22 ± 15.69	140.34 ± 15.24	139.77 ± 17.37	138.31 ± 16.40	138.85 ± 16.41	136.61 ± 16.33
Hct (%)	40.61 ± 4.21	40.66 ± 4.05	40.41 ± 4.76	40.06 ± 4.52	40.24 ± 4.47	39.50 ± 4.62
LYM%	0.28 ± 0.08	0.28 ± 0.08	0.25 ± 0.08	0.27 ± 0.08	0.28 ± 0.08	0.25 ± 0.08
MONO%	5.45 ± 1.57	5.46 ± 1.56	5.42 ± 1.64	5.45 ± 1.64	5.52 ± 1.68	5.23 ± 1.50
NE%	64.37 ± 8.79	63.78 ± 8.62	66.71 ± 9.06	64.77 ± 9.00	64.00 ± 8.79	67.20 ± 9.27
MONO (10⁹/L)	0.39 ± 0.14	0.38 ± 0.14	0.41 ± 0.15	0.38 ± 0.13	0.37 ± 0.13	0.39 ± 0.14
NE (10⁹/L)	4.67 ± 1.72	4.54 ± 1.63	5.18 ± 1.97	4.64 ± 1.79	4.46 ± 1.59	5.23 ± 2.23
MCV (fL)	88.94 ± 4.22	88.95 ± 4.22	88.92 ± 4.23	89.22 ± 4.28	89.36 ± 4.41	88.80 ± 3.81
MCH (pg)	30.69 ± 1.69	30.68 ± 1.70	30.74 ± 1.63	30.79 ± 1.64	30.81 ± 1.67	30.72 ± 1.54
MCHC (g/L)	345.09 ± 10.77	344.93 ± 10.90	345.73 ± 10.26	345.12 ± 10.11	344.90 ± 9.94	345.85 ± 10.66
RDW-SD (fL)	42.19 ± 2.95	42.18 ± 2.97	42.21 ± 2.88	42.11 ± 2.91	42.19 ± 2.79	41.87 ± 3.27
RDW-CV (%)	13.06 ± 0.83	13.05 ± 0.83	13.09 ± 0.82	13.00 ± 0.67	13.00 ± 0.63	12.98 ± 0.76
EOS (10⁹/L)	0.11 (0.06, 0.18)	0.11 (0.06, 0.18)	0.12 (0.06, 0.19)	0.11 (0.06, 0.18)	0.10 (0.06, 0.17)	0.13 (0.07, 0.22)
EOS%	1.60 (0.90, 2.60)	1.60 (0.90, 2.60)	1.60 (0.90, 2.70)	1.60 (0.90, 2.70)	1.55 (0.90, 2.60)	1.70 (0.90, 3.00)
BAS (10⁹/L)	0.02 (0.02, 0.04)	0.02 (0.02, 0.04)	0.02 (0.01, 0.04)	0.02 (0.01, 0.04)	0.02 (0.01, 0.04)	0.02 (0.01, 0.04)
BAS%	0.40 (0.20, 0.60)	0.40 (0.20, 0.60)	0.30 (0.20, 0.60)	0.30 (0.20, 0.60)	0.30 (0.20, 0.60)	0.30 (0.20, 0.50)
MPV (fL)	10.42 ± 0.96	10.43 ± 0.96	10.40 ± 0.94	10.39 ± 0.89	10.38 ± 0.84	10.41 ± 1.02
PCT (%)	0.23 ± 0.06	0.23 ± 0.05	0.23 ± 0.06	0.22 ± 0.06	0.23 ± 0.06	0.21 ± 0.06
PDW (%)	12.65 ± 2.13	12.62 ± 2.11	12.80 ± 2.22	12.59 ± 2.08	12.53 ± 2.00	12.78 ± 2.34
P-LCR (%)	28.51 ± 7.73	28.53 ± 7.77	28.46 ± 7.56	28.15 ± 7.17	28.10 ± 6.92	28.33 ± 7.94
AST (U/L)	18.00 (15.00, 22.00)	18.00 (15.00, 22.00)	19.00 (15.00, 22.25)	18.00 (15.00, 23.00)	18.00 (15.00, 23.00)	18.00 (15.00, 23.00)
ALT (U/L)	23.58 ± 15.15	23.33 ± 15.18	24.53 ± 15.02	23.27 ± 12.71	23.09 ± 12.94	23.85 ± 11.98
TG (mmol/L)	1.70 ± 0.98	1.66 ± 0.94	1.84 ± 1.12	1.75 ± 1.06	1.74 ± 1.05	1.80 ± 1.08
TC (mmol/L)	4.14 ± 1.01	4.11 ± 0.98	4.27 ± 1.11	4.17 ± 1.00	4.16 ± 0.96	4.19 ± 1.13
HDL-C (mmol/L)	1.12 ± 0.26	1.13 ± 0.27	1.06 ± 0.25	1.12 ± 0.28	1.13 ± 0.28	1.07 ± 0.25
LDL (mmol/L)	2.43 ± 0.88	2.39 ± 0.84	2.56 ± 1.00	2.42 ± 0.85	2.40 ± 0.82	2.49 ± 0.95
FFA (mmol/L)	0.47 ± 0.25	0.47 ± 0.25	0.48 ± 0.24	0.50 ± 0.26	0.51 ± 0.26	0.48 ± 0.26
Non-HDL-C (mmol/L)	3.02 ± 0.97	2.97 ± 0.94	3.19 ± 1.09	3.05 ± 0.97	3.03 ± 0.93	3.12 ± 1.09
sdLDL (mmol/L)	0.70 ± 0.36	0.69 ± 0.36	0.72 ± 0.38	0.71 ± 0.37	0.71 ± 0.37	0.72 ± 0.39
C1Q (μg/mL)	169.15 ± 30.93	168.86 ± 30.21	170.32 ± 33.62	169.00 ± 31.31	170.28 ± 31.41	164.89 ± 30.78
GLU (mmol/L)	6.56 ± 2.59	6.50 ± 2.59	6.77 ± 2.58	6.57 ± 2.40	6.53 ± 2.33	6.68 ± 2.63
hs-CRP (mg/L)	2.00 ± 2.50	1.93 ± 2.45	2.27 ± 2.66	2.06 ± 2.67	2.02 ± 2.66	2.16 ± 2.71
UR (mmol/L)	5.55 ± 1.84	5.48 ± 1.81	5.86 ± 1.95	5.57 ± 1.68	5.52 ± 1.66	5.75 ± 1.74
CR (μmol/L)	74.58 ± 18.75	73.22 ± 17.76	79.88 ± 21.43	73.99 ± 17.67	72.71 ± 16.01	78.10 ± 21.75
UA (mmol/L)	340.33 ± 84.27	336.13 ± 81.62	356.74 ± 92.31	342.12 ± 89.44	339.11 ± 90.25	351.74 ± 86.47
HbA1c (%)	6.42 ± 1.22	6.41 ± 1.23	6.50 ± 1.17	6.46 ± 1.21	6.44 ± 1.18	6.52 ± 1.31
Glycated albumin %	15.17 ± 3.76	15.13 ± 3.79	15.32 ± 3.65	15.32 ± 3.66	15.21 ± 3.62	15.69 ± 3.78
Glycated albumin (mmol/L)	0.56 ± 0.17	0.56 ± 0.18	0.56 ± 0.17	0.57 ± 0.17	0.56 ± 0.17	0.58 ± 0.17
Hcy (mmol/L)	16.08 ± 9.12	15.75 ± 9.05	17.34 ± 9.29	15.75 ± 7.98	15.48 ± 8.04	16.62 ± 7.74
CK (U/L)	77.00 (59.00, 106.00)	77.00 (59.00, 107.80)	76.00 (60.00, 101.00)	80.00 (58.00, 108.00)	80.00 (58.00, 107.80)	82.00 (58.00, 108.00)
CK-MB (ng/dL)	1.80 (1.20, 2.50)	1.80 (1.20, 2.50)	1.90 (1.30, 2.60)	1.70 (1.20, 2.40)	1.70 (1.20, 2.30)	1.90 (1.20, 2.60)
LDH (U/L)	179.94 ± 52.80	180.39 ± 55.29	178.20 ± 41.70	176.48 ± 43.20	177.80 ± 44.75	172.26 ± 37.69
K (mmol/L)	4.10 ± 0.38	4.09 ± 0.37	4.15 ± 0.41	4.10 ± 0.37	4.09 ± 0.37	4.13 ± 0.35
Cl (mmol/L)	103.16 ± 2.64	103.20 ± 2.61	103.03 ± 2.77	103.25 ± 2.69	103.23 ± 2.76	103.34 ± 2.45
Ca (mmol/L)	2.30 ± 0.10	2.30 ± 0.10	2.29 ± 0.10	2.30 ± 0.11	2.30 ± 0.11	2.30 ± 0.09
P (mmol/L)	1.13 ± 0.16	1.13 ± 0.16	1.13 ± 0.16	1.13 ± 0.17	1.12 ± 0.17	1.14 ± 0.17
Mg (mmol/L)	0.90 ± 0.07	0.89 ± 0.07	0.90 ± 0.07	0.90 ± 0.07	0.89 ± 0.07	0.91 ± 0.06
TP (g/L)	70.87 ± 5.14	70.96 ± 5.09	70.53 ± 5.33	70.94 ± 5.37	71.01 ± 5.50	70.71 ± 4.96
ALB (g/L)	43.83 ± 3.30	43.89 ± 3.25	43.60 ± 3.51	43.99 ± 3.30	44.03 ± 3.34	43.89 ± 3.15
TBIL (μmol/L)	12.16 ± 5.77	12.13 ± 5.64	12.28 ± 6.29	12.10 ± 5.28	12.25 ± 5.46	11.63 ± 4.63
DBIL (μmol/L)	5.01 ± 2.07	5.01 ± 2.01	5.04 ± 2.26	5.10 ± 2.01	5.15 ± 2.08	4.93 ± 1.79
ALP (U/L)	73.11 ± 19.73	72.57 ± 19.67	75.19 ± 19.87	72.11 ± 22.20	72.74 ± 21.83	70.12 ± 23.31
GGT (U/L)	32.10 ± 26.74	31.91 ± 27.23	32.88 ± 24.80	33.65 ± 33.67	34.60 ± 36.91	30.62 ± 19.92
TBA (μmol/L)	4.40 ± 3.77	4.38 ± 3.78	4.50 ± 3.76	4.34 ± 3.55	4.19 ± 3.33	4.81 ± 4.15
CHE (kU/L)	8.59 ± 1.46	8.58 ± 1.45	8.63 ± 1.50	8.64 ± 1.56	8.67 ± 1.63	8.56 ± 1.33
eGFR	91.40 ± 15.27	91.86 ± 14.65	89.62 ± 17.42	91.70 ± 14.26	92.10 ± 14.24	90.42 ± 14.32
BNP (pg/mL)	34.00 (19.00, 74.00)	31.00 (18.00, 66.00)	56.00 (28.00, 178.80)	36.00 (20.00, 78.00)	34.00 (18.00, 72.00)	47.00 (25.00, 143.00)
PT (s)	11.52 ± 0.94	11.48 ± 0.87	11.66 ± 1.17	11.46 ± 0.87	11.42 ± 0.85	11.60 ± 0.91
PT%	99.86 ± 11.97	100.26 ± 11.69	98.30 ± 12.93	100.57 ± 11.97	101.20 ± 11.99	98.56 ± 11.72
APTT (s)	30.47 ± 3.47	30.42 ± 3.54	30.67 ± 3.16	30.21 ± 3.10	30.30 ± 3.23	29.95 ± 2.66
INR	1.00 (0.96, 1.04)	1.00 (0.96, 1.04)	1.00 (0.97, 1.05)	1.00 (0.96, 1.04)	1.00 (0.95, 1.04)	1.01 (0.97, 1.05)
FDP (μg/mL)	0.20 (0.00, 0.60)	0.20 (0.00, 0.50)	0.40 (0.00, 0.80)	0.30 (0.00, 0.60)	0.30 (0.00, 0.60)	0.20 (0.00, 0.60)
FBG (g/L)	3.16 ± 0.59	3.15 ± 0.59	3.23 ± 0.59	3.12 ± 0.60	3.10 ± 0.60	3.18 ± 0.58
DD (ng/mL)	88.50 (55.25, 135.80)	83.00 (54.00, 127.00)	106.00 (59.00, 172.20)	88.00 (57.00, 138.50)	87.00 (54.25, 135.80)	96.00 (63.00, 143.00)
AIP	0.13 (-0.05, 0.31)	0.12 (-0.06, 0.30)	0.18 (0.00, 0.34)	0.14 (-0.03, 0.31)	0.12 (-0.05, 0.29)	0.20 (0.03, 0.32)
TyG	8.91 ± 0.61	8.88 ± 0.61	9.01 ± 0.61	8.94 ± 0.61	8.93 ± 0.62	8.99 ± 0.59
MONO/LYM	0.19 (0.15, 0.26)	0.19 (0.15, 0.25)	0.21 (0.16, 0.28)	0.20 (0.15, 0.25)	0.20 (0.15, 0.25)	0.21 (0.16, 0.29)
LVEDD (mm)	30.74 ± 5.12	30.23 ± 4.45	32.76 ± 6.83	31.25 ± 5.67	30.71 ± 5.22	32.99 ± 6.64
EF (%)	62.67 ± 7.01	63.44 ± 6.02	59.66 ± 9.39	62.61 ± 7.38	63.29 ± 6.68	60.44 ± 8.95

Data are shown as mean ± standard deviation or median (interquartile range), unless otherwise specified.

CTO, chronic total occlusion; BMI, body mass index; SP, systolic blood pressure; DP, diastolic blood pressure; WBC, white blood cells; RBC, red blood cells; PLT, platelets; Hb, hemoglobin; Hct, hematocrit; LYM%, lymphocyte percentage; MONO%, monocyte percentage; NE%, neutrophil percentage; MONO, monocytes; NE, neutrophils; MCV, mean corpuscular volume; MCH, mean corpuscular hemoglobin; MCHC, mean corpuscular hemoglobin concentration; RDW-SD, red cell distribution width-standard deviation; RDW-CV, red cell distribution width-corpuscular volume; EOS, eosinophils; EOS%, eosinophils percentage; BAS, basophils; BAS%, basophils percentage; MPV, mean platelet volume; PCT, plateletcrit; PDW, platelet distribution width; P-LCR, platelet-larger cell ratio; AST, aspartate transaminase; ALT, alanine transaminase; TG, triglycerides; TC, total cholesterol; HDL-C, high-density lipoprotein cholesterol; LDL, low-density lipoprotein cholesterol; FFA, free fatty acids; sdLDL, small dense low density lipoprotein; GLU, glucose; hs-CRP, high-sensitivity C-reactive protein; UR, urea; CR, creatinine; UA, uric acid; HbA1c, glycosylated hemoglobin type A1C; Hcy, homocysteine; CK, creatine kinase; CK-MB, creatine kinase-MB; LDH, lactate dehydrogenase; K, potassium; Cl, chlorine; Ca, calcium; P, phosphorus; Mg, magnesium; TP, total protein; ALB, albumin; TBIL, total bilirubin; DBIL, direct bilirubin; ALP, alkaline phosphatase; GGT, gamma-glutamyl transferase; TBA, total bile acid; CHE, cholinesterase; eGFR, estimated glomerular filtration rate; BNP, brain natriuretic peptide; PT, prothrombin time; PT%, prothrombin time percentage; APTT, activated partial thromboplastin time; INR, international normalized ratio; FDP, fibrin/fibrinogen degradation products; FBG, fibrinogen; DD, D-dimer; AIP, atherogenic index of plasma; TyG, triglyceride-glucose; LVEDD, left ventricular end diastolic diameter; EF, ejection fraction.

Machine learning algorithms

Data were pre-processed by variable normalization, transformed using log10 and autoscaling functions, and then subjected to analysis by machine learning algorithms (i.e., models). The following three models were used to identify CTO risk factors and predict CTO presence: partial least squares discriminant analysis (PLS-DA), random forest (RF), and support vector machine (SVM).

The performance of each model was evaluated by Monte-Carlo cross-validation via balanced sub-sampling to generate receiver operating characteristic (ROC) curves. In each run of Monte-Carlo cross-validation, the models were trained using data for two-thirds of the patients (i.e., the derivation cohort), then validated using data for the remaining one-third of the patients (i.e., the validation cohort); the top 3, 5, 10, 20, half-maximum, and most important characteristics were identified. This process was repeated multiple times to determine the performance (with 95% confidence interval [CI]) of each model. Variable importance in each model was assessed using the Gini criterion. Results were visualized using Venn graphs and UpSet plots generated by the Venn and UpSetR packages in R4.2.0 software, respectively.

Statistical analysis

Continuous variables were presented as means ± standard deviations; differences between groups were analyzed by Student's t-test. Categorical variables were presented as frequencies and percentages; differences between groups were evaluated by the chi-square test or Fisher's exact test. DeLong’s test was used to compare prediction model performances. Statistical analysis was performed using SPSS version 23.0 software (IBM Corp., Armonk, NY, USA).

Results

Patient characteristics

This study included 1473 patients (1049 [71.22%] men). Of these patients, 317 (21.5%) were diagnosed with CTO. Table 1 summarizes the patients’ clinical characteristics. For each patient, 83 variables were collected, including demographic characteristics, clinical findings, laboratory test results, and echocardiography results. Compared with the non-CTO group, the CTO group had higher percentages of men and current smokers. Additionally, the CTO group had comparatively higher values for white blood cells, neutrophil percentage (NE%), monocytes, eosinophils, aspartate transaminase, triglycerides, low-density lipoprotein cholesterol, non-HDL-C, urea, creatinine, uric acid, homocysteine, creatine kinase-MB, potassium, brain natriuretic peptide (BNP), prothrombin time, international normalized ratio, fibrin/fibrinogen degradation products, fibrinogen, D-dimer, atherogenic index of plasma, TyG, MONO/LYM, and LVEDD. Conversely, the CTO group had comparatively lower values for systolic blood pressure, platelets, lymphocyte percentage, plateletcrit, HDL-C, estimated glomerular filtration rate, and EF.

Variable selection

For identification of the most important characteristics, three machine learning models were used to select variables. The results are presented in Figure 1, which shows the relationship between ROC value and variable number. The PLS-DA model had the highest ROC value with 10 variables (Figure 1a), the RF model had the highest ROC value with 42 variables (Figure 1b), and the SVM model had the highest ROC value with 20 variables (Figure 1c). As the number of variables increased to 83, the ROC values remained stable.

Figure 1.

Relationships between receiver operating characteristic curve value and number of variables in (a) partial least squares discriminant analysis (PLS-DA), (b) random forest (RF), and (c) support vector machine (SVM) models for chronic total occlusion (CTO) prediction. AUC, area under the curve; CI, confidence interval.

Key variables

Key predictors in the three models were identified according to Gini impurity (Figure 2). In the PLS-DA model, the top five predictors were BNP (pg/mL), LVEDD (mm), EF (%), neutrophils (10⁹/L), and lymphocyte percentage with relative importance scores of 2.59, 2.31, 2.29, 2.19, and 2.07, respectively (Figure 2a). In the RF model, the top five predictors were BNP (pg/mL), LVEDD (mm), EF (%), neutrophils (10⁹/L), and sex with relative importance scores of 7.14, 6.90, 4.62, 4.46, and 4.25, respectively (Figure 2b). In the SVM model, the top five predictors were prothrombin time percentage, sex, TyG, plateletcrit (%), and BNP (pg/mL) with relative importance scores of 0.50, 0.44, 0.37, 0.35, and 0.29, respectively (Figure 2c).

Figure 2.

Variable importances for chronic total occlusion (CTO) prediction in (a) partial least squares discriminant analysis (PLS-DA), (b) random forest (RF), and (c) support vector machine (SVM) models

Risk factor analysis

The Venn diagram (Figure 3a) shows that the predictor dataset comprised 83 variables: 21 were unique to the RF model, five were unique to the SVM model, 11 were present in both the SVM and RF models, and six were present in both the PLS-DA and RF models. Importantly, four variables were present in all three models: sex, NE%, creatinine, and BNP (Figure 3b).

Figure 3.

Potential risk factors for chronic total occlusion (CTO) in three machine learning models. The Venn diagram (a) and UpSet plot (b) identified four potential risk factors for CTO that were shared among all models.

Model performance and validation

As illustrated in Figure 4, the number of variables was chosen according to the highest ROC value for each model. For the PLS-DA model, the selection of 10 variables resulted in an internally validated ROC (95% CI) value of 0.706 (0.659–0.749) (Figure 4a). For the RF model, the selection of 42 variables resulted in an internally validated ROC (95% CI) value of 0.702 (0.653–0.749) (Figure 4b). For the SVM model, the selection of 20 variables resulted in an internally validated ROC (95% CI) value of 0.696 (0.655–0.741) (Figure 4c).

Figure 4.

Receiver operating characteristic curves of (a) partial least squares discriminant analysis (PLS-DA), (b) random forest (RF), and (c) support vector machine (SVM) models for chronic total occlusion (CTO) prediction in the internal validation set AUC, area under the curve; CI, confidence interval.

Comparison of ROC curves among models

Model performance was further characterized using DeLong’s test to assess the area under the ROC in the internal validation cohort (Figure 5). This assessment revealed no differences among the three models (P = 0.97 for PLS-DA vs. RF; P = 0.80 for PLS-DA vs. SVM; P = 0.82 for RF vs. SVM).

Figure 5.

Comparison of area under the receiver operating characteristic curve values among partial least squares discriminant analysis (PLS-DA), random forest (RF), and support vector machine (SVM) models.

Discussion

In this study, we used three machine learning models to predict CTO development on the basis of patients’ clinical and demographic characteristics, as well as their biochemical parameters and echocardiography findings. The results showed that the PLS-DA model had the highest accuracy in predicting CTO development, with an ROC value of 0.706. Furthermore, four variables (sex, NE%, creatinine, and BNP) were present in all three models; they may comprise major risk factors for CTO development. These variables can easily be measured, and they may help clinicians to identify patients with higher CTO risk, thereby facilitating earlier intervention and improving management.

The emergence of computational methods based on machine learning algorithms has enabled the development of predictive models that use large datasets with multiple variables.⁹ Each machine learning algorithm has unique strengths and limitations.¹⁰ PLS-DA, RF, and SVM are the most commonly used machine learning algorithms.¹¹ In the present study, we compared the CTO prediction performances of these algorithms.

PLS-DA is an algorithm commonly used for supervised classification tasks.¹² In PLS-DA, categorical variables are regarded as dependent variables to explore linear relationships with the independent variables. This method is particularly useful for datasets with high-dimensional variables and limited sample sizes because it can extract meaningful information from a large number of variables and identify the most important variables for classification.¹³

RF, a classification algorithm based on decision trees,¹⁴ functions by generating multiple decision trees that use randomly selected subsets of characteristics and observations. This method is suitable for tabular data containing both continuous and categorical variables; it can rapidly create an optimal prediction or classification model.¹⁵ A key advantage of RF is its ability to reduce overfitting by extracting a subset of data via sampling, then generating decision trees that are independent of each other and based on small numbers of variables.¹⁶ This approach reduces the cost function; it improves prediction results by majority voting and evaluating the importance of independent variables. In an RF model, each decision tree represents a class prediction; the class choice made by the greatest proportion of trees reflects the model’s prediction.¹⁷

SVM, a common supervised classification algorithm, can be linear or non-linear.¹⁸ It is designed to identify optimal decision boundaries between classes by transforming input variables into a high-dimensional space, where the optimal boundary can be obtained via maximum margin criteria.¹⁹ When classes cannot be separated in a linear manner, the algorithm uses a kernel function to map the variables into a higher dimensional space. The hyperplane in this high-dimensional space becomes the decision boundary in the original variable space.²⁰

We trained the above machine learning models based on input variables including clinical and demographic characteristics, biochemical parameters, and echocardiography findings. Generally, model error decreases as the number of included variables increases, but excessive inclusion of variables does not provide any practical benefit in clinical practice.²¹ To optimize model efficiency, we compared the number of variables with the highest ROC value for each algorithm. As shown in Figure 1, use of the top 10, 42, and 20 important variables in the PLS-DA, RF, and SVM models, respectively, led to better performance compared with the use of all variables in the corresponding models. Among the three models, PLS-DA had the best CTO prediction ability and required the fewest number of variables, leading to greater convenience in clinical settings.²²

This study was performed to identify key risk factors associated with coronary artery lesion progression to a CTO, a phenomenon that remains poorly understood despite the known associations of traditional CAD risk factors with disease occurrence.²³ Previous research has mainly focused on stable CAD or acute coronary syndrome²⁴; there has been limited information regarding CTO. To determine variable importances, the Gini criterion was used to consider all possible variable combinations in each machine learning model²⁵; this approach identified sex, NE%, creatinine, and BNP as potential critical risk factors for CTO development. Considering that the recanalization of a completely occluded vessel requires substantial time and resources,²⁶ there is a need to prevent CTO development in patients with CAD.

Overall, this study explored novel risk factors for progression from CAD to CTO; it also developed prediction models based on machine learning algorithms. Analyses of clinical and demographic characteristics, biochemical parameters, and echocardiography findings showed that sex, NE%, creatinine, and BNP, were critical risk factors for CTO development. The best prediction model was PLS-DA, which demonstrated high accuracy in the internal validation set. These findings highlight potential applications of machine learning models in understanding CTO risk factors; such models offer a practical tool for diagnostic prediction before CAG.

This study had a few limitations. First, it was conducted at a single center and lacked an external validation cohort, which might reduce the generalizability of the machine learning models. Second, no physiological or coronary computed tomography angiography assessments were conducted prior to CAG. Finally, the study only included patients with unstable angina pectoris, which may have led to information bias because it permitted higher risk of outcome misclassification.

Conclusions

In this study, we developed and validated three machine learning prediction models for CTO in patients with unstable angina pectoris using demographic characteristics, biochemical parameters, and echocardiography findings. The results showed that the PLS-DA model had the best prediction performance and required the smallest number of variables. Furthermore, sex, NE%, creatinine, and BNP were identified as potential risk factors for progression from CAD to CTO. Despite the limitations of the study, our findings provide useful insights for future research. Further studies with larger patient samples and other relevant variables are needed to improve the accuracies of our prediction models, thereby facilitating early diagnosis of CTO in patients with unstable angina pectoris.

Footnotes

Acknowledgement

We thank Jesse Luo for assistance with manuscript preparation.

Author contributions

YS and JL conceived the study and designed the protocol. YS, YL, and ZC analyzed the data and wrote the manuscript. YS and WJ were responsible for study selection, data extraction, and evaluation of study quality. JL critically reviewed the manuscript. All authors read and approved the final manuscript.

Availability of data and materials

The datasets used during the study are available from the corresponding author on reasonable request.

Declaration of conflicting interests

The authors declare that there is no conflict of interest.

Funding

This work was supported by the National Natural Science Fund of China (Nos. 82200441, 81970291, and 82170344), the Beijing Hospitals Authority Youth Programme (No. QML20230607), the Young Elite Scientists Sponsorship Program by BAST (No. BYESS2023238), and the Major State Basic Research Development Program of China (973 Program, No. 2015CB554404).

ORCID iD

Yuchen Shi

References

Ybarra

Rinfret

Brilakis

, et al. Definitions and Clinical Trial Design Principles for Coronary Artery Chronic Total Occlusion Therapies: CTO-ARC Consensus Recommendations. Circulation 2021; 143: 479–500.

Azzalini

Karmpaliotis

Santiago

, et al . Contemporary issues in chronic total occlusion percutaneous coronary intervention. JACC Cardiovasc Interv 2022; 15: 1–21.

Assali

Buda

Megaly

, et al. Update on chronic total occlusion percutaneous coronary intervention. Prog Cardiovasc Dis 2021; 69: 27–34.

Panteris

Deda

Papazoglou

, et al. Machine learning algorithm to predict obstructive coronary artery disease: insights from the CorLipid trial. Metabolites 2022; 12: 816.

Zhu

Zheng

Gao

, et al. The correlation between lipoprotein(a) elevations and the risk of recurrent cardiovascular events in CAD patients with different LDL-C levels. BMC Cardiovasc Disord 2022; 22: 171.

Shi

Zheng

Liu

, et al. Leveraging machine learning techniques to forecast chronic total occlusion before coronary angiography. J Clin Med 2022; 11: 6993.

Galassi

Werner

Boukhris

, et al. Percutaneous recanalisation of chronic total occlusions: 2019 consensus document from the EuroCTO Club. EuroIntervention 2019; 15: 198–208.

Collins

Reitsma

Altman

, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015; 350: G7594.

Liu

Ding

Shen

, et al. Automatic assessment of collateral physiology in chronic total occlusions by means of artificial intelligence. Cardiol J 2022. doi: 10.5603/CJ.a2022.0089. [Epub ahead of print]

10.

Doupe

Faghmous

Basu

Machine learning for health services researchers. Value Health 2019; 22: 808–815.

11.

Ghosh

Zhang

Ghosh

, et al. Predictive modeling for metabolomics data. Methods Mol Biol 2020; 2104: 313–336.

12.

Gromski

Muhamadali

Ellis

, et al. A tutorial review: metabolomics and partial least squares-discriminant analysis–a marriage of convenience or a shotgun wedding. Anal Chim Acta 2015; 879: 10–23.

13.

Lee

Liong

Jemain

AA.

Partial least squares-discriminant analysis (PLS-DA) for classification of high-dimensional (HD) data: a review of contemporary practice strategies and knowledge gaps. Analyst 2018; 143: 3526–3539.

14.

Jiang

Zhi

, et al. Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol 2017; 2: 230–243.

15.

Blanchet

Vitale

Van Vorstenbosch

, et al. Constructing bi-plots for random forest: tutorial. Anal Chim Acta 2020; 1131: 146–155.

16.

Chen

Jiang

Huang

, et al. Identification of energy metabolism-related biomarkers for risk prediction of heart failure patients using random forest algorithm. Front Cardiovasc Med 2022; 9: 993142.

17.

Ambale-Venkatesh

Yang

, et al. Cardiovascular event prediction by machine learning: the multi-ethnic study of atherosclerosis. Circ Res 2017; 121: 1092–1101.

18.

Noble

WS.

What is a support vector machine?

Nat Biotechnol 2006; 24: 1565–1567.

19.

Forghani

Yazdi

HS.

Robust support vector machine-trained fuzzy system. Neural Netw 2014; 50: 154–165.

20.

Nedaie

Najafi

AA.

Support vector machine with Dirichlet feature mapping. Neural Netw 2018; 98: 87–101.

21.

Habehh

Gohel

Machine learning in healthcare. Curr Genomics 2021; 22: 291–300.

22.

Esteva

Robicquet

Ramsundar

, et al. A guide to deep learning in healthcare. Nat Med 2019; 25: 24–29.

23.

Zeng

Jian

, et al. The association between serum total bile acid level and long-term prognosis in patients with coronary chronic total occlusion undergoing percutaneous coronary intervention. Dis Markers 2022; 2022: 1434111.

24.

Hoefer

Steffens

Ala-Korpela

, et al. Novel methodologies for biomarker discovery in atherosclerosis. Eur Heart J 2015; 36: 2635–2642.

25.

Nembrini

König

Wright

MN.

The revival of the Gini importance?

Bioinformatics 2018; 34: 3711–3718.

26.

Xenogiannis

Alaswad

Krestyaninov

, et al. Impact of adherence to the hybrid algorithm for initial crossing strategy selection in chronic total occlusion percutaneous coronary intervention. Rev Esp Cardiol (Engl Ed) 2021; 74: 1023–1031.