Sage Journals: Discover world-class research

Abstract

Objectives:

This study aims to introduce a prediction model based on a machine learning approach as an efficient solution for prediction purposes to better prognosis and increase CRC survival.

Methods:

In the current retrospective study, we used the data of 1062 CRC cases to analyse and establish a prediction model for the 5-year CRC survival. The machine learning algorithms were used to develop prediction models, including random Forest, XG-Boost, bagging, logistic regression, support vector machine, artificial neural network, decision tree, and K-nearest neighbours.

Results:

The current study revealed that the XG-Boost with AU-ROC of 0.906 and 0.813 for internal and external conditions gave us better insight into predictability and generalizability than other algorithms.

Conclusion:

XG-Boost can be utilised as a knowledge source for implementing intelligent systems as an assistive tool for clinical decision-making in healthcare settings to improve prognosis and increase CRC survival through various clinical solutions that doctors can achieve.

Keywords

colorectal cancer survival rate prognostic factor machine learning prediction model

Highlights

• Machine learning algorithms were leveraged to establish prediction models for the 5-year survival of CRC.

• A combination of pathological, laboratory, therapy, socioeconomics, and lifestyle factors was used to predict this topic.

• XG-Boost is a satisfactory model for predicting the 5-year survival of CRC.

• The pathological and therapy factors are remarkable for prediction on this topic.

• This study demonstrated a favourable generalizability of the XG-Boost model in different clinical environments.

Introduction

Colorectal cancer (CRC) is the development and growth of tumour masses as abnormal cells in any region from the colon to the rectum.^1,2 This cancer type, with 1 400 000 and roughly 700 000 deaths, is considered the third most prevalent and fourth cause of mortality among other cancers worldwide.^3-5 As a primary concern of public health, CRC, with 135 430 and 50 260 rates related to new cases and death, is regarded as the third most common cancer and the second cause of death in the USA.^6,7 This disease is increasing in developing countries, especially those achieving the Western lifestyle.^8,9 Also, despite the increasing trends of CRC in incidence and mortality rate in low and middle-income countries, it is stable or in the downward state in high-income ones.^10,11

CRC accounts for the third most common cancer in Iran and is considered the fourth most prevalent cancer in men after stomach, bladder, and prostate and the second in women following breast cancer.^12,13 This disease constitutes 10% and 9.4% of cancer incidence and death, respectively, reaching 3.2 million cases globally by 2040.^14,15 This sharp escalation is due to the increase in the population of older adults and human development.^15,16 It’s demonstrated that the CRC has an ascending incidence trend in Iran.^12,17 The GLOBOCAN announced that the CRC incidence will double in Iran before 2040.^18,19

One way to evaluate the efficiency of the healthcare measures for disease control and the effects of various treatment plans is by estimating CRC survival.²⁰ The 5-year CRC survival has different amounts between nations globally; for example, this rate is less than 8% in African countries to 64%-65% in South Korea and the USA. Also, The 5-year CRC survival ranges from 27% to 85% in different points of Iran.^9,21,22 The 5-year CRC survival has increased from 37.9% in 1998-2002 to 78.6% in recent years due to the early detection of this disease in the localised tumour condition.²³ Also, the current plan for CRC prognosis uses the classic Tumor, Node, and Metastasis staging classification system, so we require a more accurate and efficient prediction model to provide better insight into the prognosis and increase the treatment efficiency.²⁴

So far, machine learning (ML) techniques have gained competency in establishing efficient prediction models with high accuracy in various aspects of medicine.^25,26 They have shown more potential in prediction aims than other methods, such as deep learning, especially when dealing with routine clinical data, despite images, signals, or videos.²⁷ Also, they have demonstrated more efficient and accurate survival predictability associated with cancer diseases, and they have been introduced as an alternative and convenient way to some conventional statistical methods, such as Cox, for prediction purposes even without prior knowledge of data.²⁸ As mentioned, developing a prediction model based on the prognostic factors has a significant role in promoting clinical solutions such as treatment strategies by early prediction prognosis of this disease at an earlier stage. Therefore, the current study aims to establish a prediction model based on machine learning to predict the 5-year CRC survival based on the prognostic factors to a better prognosis and clinical efficiency and then increase CRC survival among patients.

Methods

Community of study and database characteristics

As a data-driven and retrospective of the current study, the community research comprised CRC-confirmed cases referred to Masoud Clinical Center in Tehran City from January 2018 to December 2023. During 6 years of referral, the data of 1062 CRC cases were stored in one database. Six hundred sixty-one and 401 cases were associated with the non-survived and survived instances, respectively. The surviving cases were patients who had the CRC-positive diagnosis in their records at that centre and were alive 5 years after the primary diagnosis. On the contrary, the non-survived cases had similar characteristics but died after 5 years.

Database preprocessing

In the current study, we first prepared the current database to establish the optimal prediction model for CRC survival. First, any duplicate and noisy data were removed from the study. Second, we confronted the missing values in cases in the database. We had two scenarios for the missing values: 1-for the instances with more than 5% missing values, we excluded these cases from the study to prevent bias in further analysis. The rationale for using this amount for deleting the cases was associated with two factors: The current database had almost complete data, and considering this amount didn’t challenge our database with losing a high case omission rate. Also, with this low amount considered for case omission, the characteristics of the original database would be more preserved. 2-If the lost data in cases were less than 5%, we leveraged the mode of each value associated with the corresponding feature to fill in the missing values.

Input and outcome variables

The prognostic factors included in the current database were categorised into demographic factors, including age, sex, place of residence, Body Mass Index (BMI), lifestyle factors such as smoking and alcohol consumption, personal history of diabetes and hypertension, familial history of CRC, therapy factors such as surgery, chemotherapy, radiotherapy, and hormonotherapy, pathological factors including, tumor stage, tumor recurrence, tumor differentiation, lymphovascular invasion, perineural invasion, tumor location, and laboratory factors such as hemoglobin and white blood cells (WBC) count. The outcome variable was the 5-year survival status of CRC patients categorised into two situations of survived and non-survived, specified as 0 and 1 codes in the current database, respectively.

Feature selection

Before performing the ML process, we leveraged the feature selection (FS) technique. FS is defined as eliminating irrelevant features to clean and increase the quality of the database.²⁹ It is beneficial in preventing the overfitting of the ML algorithms, enhancing the speed of calculations, increasing the accuracy of algorithms, better understating the data, and improving generalisation.³⁰ In the current study, we used binary logistic regression (BLR) as a combinatory correlation analysis and weighting by Gini Index (GI) score to obtain the best prognostic factors influencing the 5-year CRC survival. The P < .05 was considered to select the best factors. We put the GI = 0.5 as a threshold to select the factors based on the GI. In other words, the factors with GI ⩽ 0.5 (more robust classification of cases by variable) were considered influential factors in predicting the 5-year CRC survival among patients. On the contrary, the factors obtained GI > 0.5 (weaker classification of cases by variable) were excluded from the data analysis and ML model building.

Model establishment and assessment

We leveraged chosen ML algorithms to build the prediction models for the 5-year CRC survival. In this respect, the algorithms of Random Forest (RF), XG-Boost, bagging, logistic regression (LR), Support Vector Machine (SVM), Artificial Neural Networks (ANN), Decision Tree (DT), and K-nearest Neighbors (KNN) were utilised to establish the prediction models. These ML algorithms were leveraged due to their high applicability and favourable performance for prediction purposes, as demonstrated in other biomedical research.^31,32 The performance of the chosen algorithms was evaluated by using the positive predictive value (PPV) (equation (1)), negative predictive value (NPV) (equation (2)), sensitivity (equation (3)), specificity (equation (4)), accuracy (equation (5)), F-Score (equation (6)), and the Area Under the Receiver Operator Characteristics (AU-ROC). In equations, the TP and TN refer to non-survived and survived cases correctly categorised by the ML algorithms. FN and FP are these cases that are incorrectly classified.

The Grid search method was used to adjust the ML algorithms’ hyperparameters during the learning process. In this method, each algorithm is tested manually using different ranges of hyperparameter values, and the best algorithm is chosen based on the best combination of values, possessing higher performance. We used the K-fold cross-validation as a data-splitting strategy for the learning process. In this strategy, the data are split into K sections. One section tests the algorithms’ performance, and (K-1) sections are used for training them. This process randomly occurs in K epochs, and the data are chosen randomly for training and testing processes in each epoch. The algorithms’ performance is equal to the average performance of epochs. The stratified K-fold cross-validation is more common considering the data imbalance based on the class distribution. This way, the data are sampled randomly based on the class distribution, which is crucial in reducing the bias created during the learning process.³³ In the current study, the (K = 10) was used to split the data as a widely used data-splitting strategy in other studies,³⁴ so the stratified 10-fold cross-validation is leveraged for testing and training purposes.

PPV = \frac{TP}{TP + FP}

(1)

NPV = \frac{TN}{TN + FN}

(2)

Sensitivity = \frac{TP}{TP + FN}

(3)

Specificity = \frac{TN}{TN + FP}

(4)

Accuracy = \frac{TP + TN}{TP + FN + FP + TN}

(5)

F - Score = \frac{TP}{TP + (1 / 2 (FN + FP))}

(5)

External validation assessment

In the current study, we used the external validation assessment to measure the generalizability and applicability of our prediction model in other clinical settings. We used 108 CRC cases from the Imam Khomeini Hospital in Sari City to evaluate the generalizability. Among these CRC cases, 68 and 40 cases were associated with non-survived and survived instances, respectively. To assess the generalizability of the algorithm, we used the classifiability metrics of TP, FP, FN, and TN, as well as the AU-ROC in the internal and external modes. Moreover, we highlighted the importance of the prognostic factors in the internal and external modes. In this regard, we used the relative importance of the two modes (RI) obtained by the ML algorithm.

Result

Database preparation and sample characteristics

First, by identifying the duplicate cases belonging to one patient with the same ID in the current database, eight cases were removed from the study, including two and six cases associated with the survived and non-survived cases. Second, 11 cases, including 4 and 7 cases belonging to survived and non-survived patients with more than 5% missing data, were excluded from the current study. The lost data of 31 cases with less than 5% missing data were filled by the mode of the values of the corresponding feature. Finally, the 1043 CRC cases were included in the current study. Among these cases, 648 and 395 belonged to the non-survived and survived cases, respectively. The statistical characteristics of the cases included in the data analysis are presented in Table 1. The difference between the two survived and non-survived groups is given by the Chi-square (P < .05).

Table 1.

Characteristics of cases included in the analysis.

Features	Values	Total cases n = 1043	Non-survived cases n = 648	Survived cases n = 395	P-value
Age (years)	<55	379	247	132	<0.001*
Age (years)	>55	664	401	263	<0.001*
Sex	Male	496	333	163	0.1
Sex	Female	547	315	232	0.1
Place of residence	Rural	646	397	249	0.08
Place of residence	Urban	397	251	146	0.08
BMI (Kg/m)*	<18.5	334	254	80	<0.001*
	18.5-25	423	333	90
	25-30	212	43	169
	>30	74	18	56
Smoking	Yes	548	416	132	<0.001*
Smoking	No	495	232	263	<0.001*
Alcohol consumption	Yes	227	167	60	0.1
Alcohol consumption	No	816	481	335	0.1
Diabetes	Yes	623	503	120	<0.001*
Diabetes	No	420	145	275
Hypertension	Yes	479	329	150	<0.001*
Hypertension	No	564	319	245	<0.001*
Familial history of CRC	Yes	448	309	139	<0.001*
Familial history of CRC	No	595	339	256	<0.001*
Surgery	Yes	876	540	336	<0.001*
Surgery	No	167	108	59	<0.001*
Chemotherapy	Yes	799	490	309	<0.001*
Chemotherapy	No	244	158	86	<0.001*
Radiotherapy	Yes	622	370	252	<0.001*
Radiotherapy	No	421	278	144	<0.001*
Hormonotherpay	Yes	355	224	131	0.04*
Hormonotherpay	No	688	421	267	0.04*
Tumor stage	I	127	42	85	<0.001*
	II	496	258	238
	III	324	289	36
	IIII	96	59	37
Tumor recurrence	Yes	325	248	77	<0.001*
Tumor recurrence	No	718	400	318	<0.001*
Tumor differentiation	Grade 1	397	198	199	<0.001*
	Grade 2	438	316	122
	Grade 3	208	134	74
lymphovascular invasion	Yes	652	349	303	<0.001*
lymphovascular invasion	No	391	299	92	<0.001*
Perineural invasion	Yes	489	317	172	<0.001*
Perineural invasion	No	554	331	223	<0.001*
Tumor location	Rectum	207	86	121	<0.001*
Tumor location	Right and Transverse colon	386	318	68
	Left colon	397	210	187
	Sigmoid	53	34	19
Hemoglobin level	Low	688	434	254	<0.001*
Hemoglobin level	Normal	355	214	141	<0.001*
WBC count (/mL**)	<55 000	691	394	297	<0.001*
WBC count (/mL**)	>55 000	352	254	98	<0.001*

Kilogram/metre. **per microliter.

According to Table 1, the prognostic factors including age, BMI, smoking, diabetes, hypertension, familial history of CRC, surgery, chemotherapy, radiotherapy, hormonotherapy, tumor stage, tumor recurrence, tumor differentiation, lymphovascular invasion, perineural invasion, tumor location, hemoglobin level, and WBC obtained difference between the survived and non-survived cases at P < .05. On the contrary, the sex, place of residence, and alcohol consumption didn’t significantly differ between the two groups.

Feature selection

The results of feature selection based on the BLR are presented in Table 2.

Table 2.

The BLR to screen important factors.

Features	β	OR	95% CI of OR	P _-value
Age	.563	1.248	[1.113-1.684]	<.001
Sex	.106	1.017	[0.963-1.078]	.12
Place of residence	−.261	0.894	[0.745-1.284]	.2
BMI	−.548	0.693	[0.559-0.827]	<.001
Smoking	.447	1.224	[1.126-1.487]	<.001
Alcohol consumption	.321	1.106	[0.894-1.185]	0.1
Diabetes	.281	1.095	[1.07-1.113]	<.001
Hypertension	.119	1.025	[0.898-1.05]	0.1
Familial history of CRC	.552	1.288	[1.201-1.455]	<.001
Surgery	.761	1.453	[1.394-2.217]	<.001
Chemotherapy	.807	1.526	[1.438-2.543]	<.001
Radiotherapy	.61	1.399	[1.296-1.984]	<.001
Hormonotherpay	.221	1.069	[1.05-1.1]	<.001
Tumour stage	.746	1.427	[1.334-1.761]	<.001
Tumor recurrence	.597	1.303	[1.206-1.829]	<.001
Tumor differentiation	.664	1.474	[1.328-1.683]	<.001
lymphovascular invasion	.499	1.266	[1.201-1.384]	<.001
Perineural invasion	.315	1.102	[1.077-1.192]	.01
Tumor site	.689	1.494	[1.399-1.548]	<.001
Hemoglobin level	.577	1.295	[1.202-1.384]	<.001
WBC count	.621	1.374	[1.284-1.475]	<.001

In Table 2, β implies the regression coefficient, OR is the odd ratio, and CI is the confidence interval. As this table shows, the factors including age (β = .563, OR = 1.248, 95% CI of OR = [1.113-1.684]), BMI(β = −.548, OR = 0.693, 95% CI of OR = [0.559-0.827]), smoking (β = .447, OR = 1.224, 95% CI of OR = [1.126-1.487]), diabetes (β = .281, OR = 1.095, 95% CI of OR = [1.07-1.113]), familial history of CRC (β = .552, OR = 1.288, 95% CI of OR = [1.201-1.455]), surgery (β = .761, OR = 1.453, 95% CI of OR = [1.394-2.217]), chemotherapy (β = .807, OR = 1.526, 95% CI of OR = [1.438-2.543]), radiotherapy (β = .61, OR = 1.399, 95% CI of OR = [1.296-1.984]), hormonotherapy (β = .221, OR = 1.069, 95% CI of OR = [1.05-1.1]), tumor stage (β = .746, OR = 1.427, 95% CI of OR = [1.334-1.761]), tumor recurrence (β = .597, OR = 1.303, 95% CI of OR = [1.206-1.829]), tumor differentiation (β = .664, OR = 1.474, 95% CI of OR = [1.328-1.683]), lymphovascular invasion (β = .499, OR = 1.266, 95% CI of OR = [1.201-1.384]), perineural invasion (β = .315, OR = 1.102, 95% CI of OR = [1.077-1.192]), tumor site (β = .689, OR = 1.494, 95% CI of OR = [1.399-1.548]), hemoglobin level (β = .577, OR = 1.295, 95% CI of OR = [1.202-1.384]), and WBC count (β = .621, OR = 1.374, 95% CI of OR = [1.284-1.475]) obtained the significant relationship with 5-year CRC survival (P < .05). On the contrary, the prognostic factors, including sex, place of residence, alcohol consumption, and hypertension, were excluded from further analysis.

Figure 1 shows the importance of prognostic factors based on the GI scoring technique.

Figure 1.

Scoring the prognostic factors by using the GI score.

As Figure 1 shows, the factors including age (GI = 0.35), smoking (GI = 0.29), diabetes (GI = 0.47), familial history of CRC (GI = 0.37), surgery (GI = 0.25), chemotherapy (GI = 0.22), radiotherapy (GI = 0.29), hormonotherapy (GI = 0.33), tumor stage (GI = 0.17), tumor recurrence (GI = 0.26), tumor differentiation (GI = 0.24), lymphovascular invasion (GI = 0.18), perineural invasion (GI = 0.35), tumor location (GI = 0.26), hemoglobin level (GI = 0.33), and WBC count (GI = 0.39) were considered as the best factors for predicting the 5-year CRC survival by obtaining GI < 0.5. On the contrary, the factors including sex (GI = 0.64), place of residence (GI = 0.68), BMI (GI = 0.53), alcohol consumption (GI = 0.56), and hypertension (GI = 0.62) with GI > 0.5 were not considered as critical prognostic factors based on the GI.

Model development and assessment

The results of the performance evaluation of chosen algorithms by using 10-fold cross-validation as a data splitting strategy with the optimised hyperparameters by the Grid search method in three conditions of all features, selected features by BLR, and features chosen by using the GI are presented in Tables 3 and 4, respectively.

Table 3.

The performance evaluation of chosen algorithms.

Algorithm	Feature selection	PPV (%)	NPV (%)	Sensitivity (%)	Specificity (%)	Accuracy (%)	F-Score (%)	AU-ROC
RF	BLR	84.69	70.16	80.25	76.2	78.72	82.41	0.811
	GI	84.16	70.81	81.17	74.94	78.81	82.64	0.783
	None	85.83	74.75	84.1	77.22	81.5	84.96	0.825
XG-Boost	BLR	95.93	91.34	94.6	93.42	94.15	95.26	0.892
	GI	95.03	95.51	97.38	91.65	95.21	96.19	0.906
	None	92.19	88.21	92.9	87.09	90.7	92.54	0.898
Bagging	BLR	88.36	84.99	91.36	80.25	87.15	89.83	0.865
	GI	89.46	85.75	91.67	82.28	88.11	90.55	0.883
	None	87.90	76.87	85.19	80.76	83.51	86.52	0.824
SVM	BLR	86.45	79.27	87.65	77.47	83.8	87.05	0.815
	GI	84.42	73.57	83.64	74.68	80.25	84.03	0.795
	None	82.12	68.61	80.09	71.39	76.8	81.09	0.763
ANN	BLR	78.36	62.93	76.54	65.32	72.29	77.44	0.724
	GI	77.82	58.71	71.45	66.58	69.61	74.5	0.701
	None	75.61	57.24	71.76	62.03	68.07	73.63	0.668
DT	BLR	83.67	65.71	75.93	75.7	75.84	79.61	0.726
	GI	79.02	62.15	75	67.34	72.1	76.96	0.714
	None	77.16	57.52	70.37	65.82	68.65	73.61	0.679
KNN	BLR	81.45	69.44	81.33	69.62	76.89	81.39	0.786
	GI	79.61	68.88	81.94	65.57	75.74	80.76	0.765
	None	78.95	67.46	81.02	64.56	74.78	79.97	0.744
LR	BLR	72.76	52.38	67.59	58.48	64.14	70.08	0.727
	GI	71.28	49.89	65.12	56.96	62.03	68.06	0.712
	None	69.45	47.26	62.81	54.68	59.73	65.96	0.694

Table 4.

The hyperparameters optimised by Grid search.

FS-Algorithm	Hyperparameters
BLR-RF	Maximum depth = 6, number of estimators = 12, maximum number of features = 5, maximum leaf nodes = 3
GI-RF	Maximum depth = 9, number of estimators = 15, maximum number of features = 9, maximum leaf nodes = 3
RF (none-FS)	Maximum depth = 11, number of estimators = 17, maximum number of features = 11, maximum leaf nodes = 3
BLR-XG-Boost	Booster = gradient boosted tree, eta = 0.15, minimum child weight = 1, maximum depth = 5
GI-XG-Boost	Booster = gradient boosted tree, eta = 0.25, minimum child weight = 1, maximum depth = 8
XG-Boost (None-FS)	Booster = gradient boosted tree, eta = 0.35, minimum child weight = 1, maximum depth = 12
BLR-Bagging	Base classifier = J-48, number of iterations = 15, calculate out of bag = false
GI-Bagging	Base classifier = J-48, number of iterations = 20, calculate out of bag = false
Bagging (None-FS)	Base classifier = REP-Tree, number of iterations = 30, calculate out of bag = true
BLR-SVM	Control parameter (C) = 10, kernel type = RBF, RBF_gamma = 0.15, gamma = 2
GI-SVM	Control parameter (C) = 10, kernel type = RBF, RBF_gamma = 0.25, gamma = 1.5
SVM (None-FS)	Control parameter (C) = 20, kernel type = Linear, gamma = 1
BLR-ANN	Hidden layers = 10, learning rate = 1, maximum epoch = 50
GI-ANN	Hidden layers = 8, learning rate = 0.7, maximum epoch = 30
ANN (None-FS)	Hidden layers = 20, learning rate = 0.5, maximum epoch = 200
BLR-DT	Confidence factor = 0.1, minimum number of object = 1, binary splitting = false, reduced error pruning = true
GI-DT	Confidence factor = 0.2, minimum number of object = 1, binary splitting = false, reduced error pruning = true
DT (None-FS)	Confidence factor = 0.35, minimum number of object = 1, binary splitting = false, reduced error pruning = true
BLR-KNN	3 < K < 9, distance computation = Euclidean metric, cross validate = true, distance weighting = 1/distance.
GI-KNN	3 < K < 9, distance computation = Euclidean metric, cross validate = true, distance weighting = 1/distance.
KNN (None-FS)	3 < K < 9, distance computation = Euclidean metric, cross validate = true, distance weighting = 1/distance.
BLR-LR	Maximum number of iterations = 20, number of decimal places = 4
GI-LR	Maximum number of iterations = 30, number of decimal places = 5
LR (None-FS)	Maximum number of iterations = 50, number of decimal places = 5

Based on Tables 3 and 4, using the best-optimised hyperparameters, first, we compared the performance of algorithms in conditions and then determined the best one for prediction purposes. The RF with PPV of 85.83%, NPV of 74.75%, sensitivity of 84.1%, specificity of 77.22%, accuracy of 81.5%, F-score of 84.96%, and AU-ROC of 0.825 in the non-FS state obtained high performance than other states. XG-Boost with PPV of 95.93% and specificity of 93.42% in the BLR FS method and NPV of 95.51%, sensitivity of 97.38%, accuracy of 95.21%, F-score of 96.19%, and AU-ROC of 0.906 in the GI FS method outperformed other states.

Bagging with PPV of 89.46%, NPV of 85.75%, sensitivity of 91.67%, specificity of 82.28%, accuracy of 88.11%, F-score of 90.55% and AU-ROC of 0.883 in the GI FS method obtained higher performance than other states. SVM with PPV of 86.45%, NPV of 79.27%, sensitivity of 87.65%, specificity of 77.47%, accuracy of 83.8%, F-score of 87.05%, and AU-ROC of 0.815 with BLR FS method obtained better performance than other states. ANN with PPV of 78.36%, NPV of 62.93%, sensitivity of 76.54%, accuracy of 72.29%, F-Score of 77.44, AU-ROC of 0.724 in the BLR FS method, and specificity of 66.58 in the GI FS method achieved better-performing capability than other FS strategies. DT with PPV of 83.67%, NPV of 65.71%, sensitivity of 75.93%, specificity of 75.7%, accuracy of 75.84%, F-score of 79.61, and AU-ROC of 0.726 in the BLR FS method obtained better performance than others. KNN with PPV of 81.45%, NPV of 69.44%, specificity of 69.62%, accuracy of 76.89%, F-score of 81.39%, and AU-ROC of 0.786 in the BLR state and sensitivity of 81.94% in the GI FS method obtained better performance than other states for predicting the 5-year CRC survival.

Comparing the algorithms’ performance gave us insight into the XG-Boost with PPV of 95.93%, NPV of 95.51%, sensitivity of 97.38%, specificity of 93.42%, accuracy of 95.21%, F-score of 96.19%, and AU-ROC of 0.906 had the highest performance efficiency than other ML algorithms for predicting the 5-year CRC survival. On the contrary, the ANN with PPV of 75.61%, NPV of 57.24%, sensitivity of 71.45%, specificity of 62.03%, accuracy of 68.07%, and AU-ROC of 0.668 obtained the worst performance capability than other ML algorithms. The DT, with F-score of 73.61%, had the lowest performance in terms of this performance criterion. In addition, the LR with PPV of 72.76%, NPV of 52.38%, sensitivity of 67.59%, specificity of 58.48%, accuracy of 64.14%, F-Score of 70.08%, and AU-ROC of 0.727 in BLR-FS mode, PPV of 71.28%, NPV of 49.89%, sensitivity of 65.12%, specificity of 56.96%, accuracy of 62.03%, F-Score of 68.06%, and AU-ROC of 0.712 in GI mode, and PPV of 69.45%, NPV of 47.26%, sensitivity of 62.81%, specificity of 54.68%, accuracy of 59.73%, F-Score of 65.96%, and AU-ROC of 0.694 had the lowest performance than other ML in all states.

External validation test and feature assessment

As cited in the methods section, we tested our model’s performance capability to predict the 5-year CRC survival using unfamiliar data. Hence, we used the 108 CRC cases, including 68 and 40 cases associated with non-survived and survived instances, to demonstrate the best-performing model’s generalizability by feeding these data to the model. In this regard, the XG-Boost model was selected for external validity purposes. The results of classifying the external cases based on the TP, FP, FN, and TN in the XG-Boost in three conditions, including the BLR and GI as FS methods and XG-Boost without any FS, are presented in Table 5.

Table 5.

The results of the external cases classification.

Algorithm	TP	FN	FP	TN
BLR-XG-Boost	58	10	7	33
GI-XG-Boost	55	13	13	27
XG-Boost (None-FS)	51	17	17	23

Based on Table 5, The XG-Boost model with TP = 58, FN = 10, FP = 7, and TN = 33 using the BLR as FS method obtained a higher performance than other conditions. The second rank belonged to the XG-Boost with GI as FS with TP = 55, FN = 13, FP = 13, and TN = 27. The XG-Boost without FS gained the worst performance with TP = 51, FN = 17, FN = 17, and TN = 23 with more error-classified cases. The results of comparing the model’s predictability based on the AU-ROC curve are depicted in Figure 2.

Figure 2.

The XG-Boost in internal and external validations.

Figure 2 shows that the XG-Boost with BLR as FS by AU-ROC of 0.813 obtained better predictability than other conditions (closer curve to the sensitivity vertices). The XG-Boost with GI as FS by AU-ROC of 0.787 obtained the second rank, and this model without any FS method with AU-ROC of 0.763 obtained the worst predictability. Generally, by comparing this curve between internal and external conditions, we comprehended that the performance reduction of the XG-Boost was almost 10% in AU-ROC, indicating the favourable generalizability in other clinical environments (The values of AU-ROC of XG-Boost in internal conditions were presented in Table 3). The XG-Boost with BLR as FS method was considered the best model and favourable generalizability for predicting the 5-year CRC survival in the current study. We used the BLR-XG-Boost model as the best-performing algorithm to assess the importance of each prognostic factor. Hence, we assessed their importance based on the Relative Importance (RI) gained by this model in the internal and external conditions. The results of scoring each prognostic factor based on the RI are shown in Figure 3.

Figure 3.

The importance of prognostic factors in internal and external conditions.

According to Figure 3, almost all pathological factors obtained high importance. The pathological factors, including tumor differentiation (internal RI = 0.39, external RI = 0.41), tumor recurrence (internal RI = 0.44, external RI = 036), and lymphovascular invasion (internal RI = 0.43, external RI = 0.41) obtained higher importance than other pathological factors in terms of predicting 5-year survival. The therapy factors, including chemotherapy (internal RI = 0.48, external RI = 0.45) and surgery (internal RI = 0.45, external RI = 0.42), achieved more strength in predicting the 5-year survival than other prognostic factors. On the contrary, the age (internal RI = 0.15, external RI = 0.21) and BMI (internal RI = 0.18, external RI = 0.21) obtained less importance in this respect.

Discussion

In the current study, we intended to establish a prediction model for the 5-year CRC survival disease for a better prognosis, especially for the high-risk group of CRC patients with a poor prognosis, influencing the survival status. In this respect, we got assistance from one single-centred database containing prognostic factors with the help of the ML approach for effective and efficient predictability in various clinical situations. To achieve this aim, we first investigated the current database concerning any redundancy, noisy, and missing data to prepare it for data analysis. Next, we used two different strategies of the feature selection process to choose the best factors influencing the five-year CRC survival. After that, we used selected ML algorithms, including RF, XG-Boost, bagging, SVM, ANN, DT, LR, and KNN, to establish prediction models for predicting survival. We compared their performance to get the best ones in terms of predictive efficiency. Based on the best-performing trained algorithm, we used the external data cases to test the generalizability of the algorithm in other clinical environments. Also, the prognostic factors were assessed by using the best-performing algorithm in internal and external validations.

We had two feature selection strategies, including BLR and GI, as filtering methods to gain the best factors influencing the 5-year CRC survival. Based on the BLR, the factors including age, BMI, smoking, diabetes, familial history of CRC, surgery, chemotherapy, radiotherapy, hormonotherapy, tumor stage, tumor recurrence, tumor differentiation, lymphovascular invasion, perineural invasion, tumor location, hemoglobin level, and WBC were obtained as the essential factors to this aim. GI showed that the age, smoking, diabetes, familial history of CRC, surgery, chemotherapy, radiotherapy, hormonotherapy, tumor stage, tumor recurrence, tumor differentiation, lymphovascular invasion, perineural invasion, tumor location, hemoglobin level, and WBC count as best prognostic factors.

The current study gave us insight into the XG-Boost with PPV of 95.93%, NPV of 95.51%, sensitivity of 97.38%, specificity of 93.42%, accuracy of 95.21%, F-score of 96.19%, and AU-ROC of 0.906 obtained the best performance than others. Also, this algorithm with AU-ROC of 0.813 obtained favourable generalizability in predicting the 5-year CRC survival based on the external cases. Based on the XG-Boost, the prognostic factors, including tumor differentiation (internal RI = 0.39, external RI = 0.41), tumor recurrence (internal RI = 0.44, external RI = 036), lymphovascular invasion, chemotherapy (internal RI = 0.48, external RI = 0.45), and surgery (internal RI = 0.45, external RI = 0.42) were considered as the most important factors concerning 5-year CRC survival prediction. As shown, XG-Boost is an algorithm that has the potential to predict 5-year CRC survival. This algorithm could be considered an effective and efficient model for predicting 5-year CRC survival and be embedded as a knowledge base in intelligent systems, such as clinical decision support systems in clinical environments. The doctors in those settings could enter the CRC patients’ prognostic characteristics into the system and get the results of the CRC survival risk. Some clinical solutions, such as the early identification of CRC recurrence and interventional and non-interventional therapies, can be performed for high-risk patients. This way, the prognosis of these patients would be enhanced, and consequently, their survival would be increased.

So far, several studies have been conducted on CRC survival based on ML algorithms. Cardoso et al. used the ML technique to predict CRC patients’ survival. Based on their results, the XG-Boost with an AU-ROC of 0.857 obtained the best performance for prediction purposes. The clinical stage in their study was recognised as the best predictor of CRC survival.³⁵ In the current study, despite Cardoso’s study, which used more treatment factors, we focused more on both pathological and treatment factors. The pathological factors obtained higher competency than other factors for prediction purposes. Also, we tested the XG-Boost for external data cases, showing the favourable interoperability of the model in different clinical settings; this subject wasn’t considered in Cardoso’s study.

Yang et al. attempted to establish ML models to predict CRC survival using multi-omics data from the Cancer Genome Atlas (TCGA). They leveraged the bioinformatics analysis to omics data and then trained ML algorithms on these data. Their best-performing model obtained an AU-ROC of 0.755 with 10-fold cross-validation.³⁶ In the current study, leveraging the other factors, including the treatment, pathological, laboratory, demographic, and others, gave us better insight into the ML’s performance concerning CRC survival, even XG-Boost with an AU-ROC of 0.813 in the external validation state. Bibaault et al. built a prediction model for CRC survival based on the gradient-boosting algorithm using tumour characteristics, socioeconomics, and lifestyle factors. In their study, the model obtained an AU-ROC of 0.84 for prediction. In the current study, we focused more on pathological factors and concluded that they significantly enhance predictability. Also, our results indicated that the XG-Boost model with the AU-ROC of 0.906 and 0.813 for internal and external validation conditions performed more effectively in predicting 5-year CRC survival and usability in other clinical environments.³⁷

Achilonu et al. used ML and statistical approaches to predict CRC recurrence and patient survival. They showed that the ANN with an AU-ROC of 0.82 had better performance capability for predicting survival than other ML approaches. Also, their study recognised the factors, including histology, as the best factor influencing survival.³⁸ In the current study, similar to Achilonu’s study, pathological factors gained more importance than other factors, especially sociodemographic ones. Despite Achilonu’s study, we used more pathological and therapy factors, and in this condition, the performance of our ML model with an AU-ROC of 0.906 was more favourable than Achilonu’s study.

Pourhoseingholi compared ensemble and non-ensemble ML techniques for predicting the five-year CRC survival. In this respect, some prognostic factors, including tumour characteristics and therapy factors, were used to establish prediction models. Their study demonstrated that the voting algorithm with an AU-ROC of 0.96 is the best-performing model for predicting survival.³⁹ In this study, we focused more on pathological data and weren’t satisfied only by TNM data and the performance gained by internal data cases. Hence, the external data were used to test the prediction model’s generalizability and demonstrate our prediction model’s clinical usability in other clinical environments. BalajiVicharapu’s study used pathological, laboratory, and lifestyle factors to build a prediction model for this topic. The RF with AU-ROC of nearly 0.82 obtained better performance than other ML techniques.⁴⁰ The treatment factors such as performing surgery, radiotherapy, and chemotherapy are crucial prognostic factors in this respect, and in BalajiVicharapu’s study, they weren’t considered.

In the current study, despite the previous ones, we attempted to use the combinations of prognostic factors, including pathological factors in addition to TNM data, laboratory data, therapy factors, socioeconomics, and lifestyle, to establish a prediction model for the 5-year CRC survival. Also, we used the external data to test the current prediction model’s performance ability in other clinical environments, which was lacking in the previous studies. These testing scenarios gave us insight into the XG-Boost, which has favourable performance in different environments, assuring the clinical usability of the prediction model in other clinical centres.

Limitation

Despite the mentioned benefits, the current study had some limitations. First, we used the database from one clinical centre, which may somewhat influence the generalizability of the current prediction model. Second, we used some preprocessing steps to adapt the current database to establish prediction models that affect the model’s accuracy and generalizability. Despite this limitation, the external performance of the current prediction model was favourable, indicating low bias in the prediction performance (almost a 10% reduction in AU-ROC value). Third, the current study didn’t consider some factors, including genomic data and tumour markers, due to a lack of information in the database. Fourth, we had limitations in leveraging essential factors that may not be considered due to this study’s retrospective nature. Fifth, one critical step in estimating the bias of the prediction model is leveraging the external validation cohort. This way, we can assess the clinical usability and generalizability of the model to other clinical settings. Although we used this method for this aim, the samples used for this work were almost small, which may not give us a complete insight into the generalizability of the prediction model.

Conclusion

An effective and efficient prediction of CRC survival can potentially enhance CRC patients’ prognosis. This study demonstrated that XG-Boost with an AU-ROC of 0.906 can better prognosis and increase CRC survival. The model can be leveraged as a knowledge source to establish an efficient prediction system to achieve this aim.

Future Directions

For future studies, we recommend using more data from several centres for better predictability and generalizability, using actual data to fill in missing values instead of preprocessing steps to enhance the generalizability of the prediction models. The inclusion of some factors, including genomic data and tumour markers, may influence the performance to some extent, so we recommend using these factors for more enhanced survival predictability. We also suggest using a cohort study to investigate all aspects of the research and the other essential factors that should be included. Leveraging more clinical data from more clinical centres for external validity is indispensable to assess the bias and increase the prediction model’s generalizability more confidently, and it should be considered.

Footnotes

Acknowledgements

We thank the people and specialists who assisted us in all steps of this study.

Author Contribution

R.N. performed the writing, review, and editing of this manuscript.

Declaration Of Conflicting Interests:

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding:

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Availability of Data and Materials

The research data are available from the corresponding author upon reasonable request.

Ethics Approval and Consent to Participate

This study was approved by the Ethics Committee of Tehran University of Medical Sciences (TUMS) (No: IR.TUMS.SPH.REC.1398.191). Due to the retrospective nature of this study, it’s waived from the informed consent.

Consent for Publication

Not applicable

ORCID iD

Raoof Nopour

References

Ghani

Osman

NMF

. Knowledge and awareness of diabetes mellitus as a risk factor of colorectal cancer among international Islamic University Malaysia Kuantan students. Asian J Med Biomed. 2022;6(2):143-153.

Testa

Pelosi

Castelli

Colorectal cancer: genetic abnormalities, tumor progression, tumor heterogeneity, clonal evolution and tumor-initiating cells. Med Sci. 2018;6(2):31.

Eileen

Melina

Gini

, et al. Global burden of colorectal cancer in 2020 and 2040: incidence and mortality estimates from GLOBOCAN. Gut. 2023;72(2):338. doi:10.1136/gutjnl-2022-327736

Gandomani

Aghajani

Mohammadian-Hafshejani

Tarazoj

Pouyesh

Salehiniya

Colorectal cancer in the world: incidence, mortality and risk factors. Biomed Res Therapy. 2017;4(10):1656-1675.

Koilakou

Petrou

Economic evaluation of monoclonal antibodies in metastatic colorectal cancer: a systematic review. Mol Diag Therapy. 2021;25:715-734.

Ansa

Coughlin

Alema-Mensah

Smith

SA.

Evaluation of colorectal cancer incidence trends in the United States (2000–2014). J Clin Med. 2018;7(2):22. doi:10.3390/jcm7020022

Schlottmann

Strassle

Cairns

Herbella

Fichera

Patti

Disparities in emergent colectomy for colorectal cancer contribute to inequalities in postoperative morbidity and mortality in the US health care system. Scand J Surg. 2020;109(2):102-107.

Rawla

Sunkara

Barsouk

Epidemiology of colorectal cancer: incidence, mortality, survival, and risk factors. Gastroenterol Rev/Przegląd Gastroenterol. 2019;14(2):89-103.

Keum

Giovannucci

Global burden of colorectal cancer: emerging trends, risk factors and prevention strategies. Nature Rev Gastroenterol & Hepatol. 2019;16(12):713-732.

10.

Aguiar

Jr Oliveira

Mello

Calsavara

Curado

MP.

Survival of patients with colorectal cancer in a cancer center. Arq de Gastroenterol. 2020;57:172-177.

11.

Arnold

Sierra

Laversanne

Soerjomataram

Jemal

Bray

Global patterns and trends in colorectal cancer incidence and mortality. Gut. 2017;66(4):683-691.

12.

Dolatkhah

Somi

Kermani

, et al. Increased colorectal cancer incidence in Iran: a systematic review and meta-analysis. BMC Public Health. 2015;15(1):997. doi:10.1186/s12889-015-2342-9

13.

Amirkhah

Naderi-Meshkin

Mirahmadi

Allahyari

Sharifi

HR.

Cancer statistics in Iran: towards finding priority for prevention and treatment. Cancer Press Journal. 2017;3(2):27-38.

14.

Basudan

Basuwdan

Abudawood

Farzan

Alfhili

MA.

Comprehensive retrospective analysis of colorectal cancer incidence patterns in Saudi Arabia. Life. 2023;13(11):2198.

15.

Global colorectal cancer burden in 2020 and projections to 2040. Translat Oncol. 2021;14(10):101174.

16.

Vabi

Gibbs

Parker

GS.

Implications of the growing incidence of global colorectal cancer. J Gastro Oncol. 2021;12(Suppl 2):S387.

17.

Abdifard

Amini

Bab

Masroor

Khachian

Heidari

Incidence trends of colorectal cancer in Iran during 2000-2009: a population-based study. Med J Islam Rep Iran. 2016;30:382.

18.

Hoseini

Rahmatinejad

Goshayeshi

, et al. Colorectal Cancer in North-Eastern Iran: a retrospective, comparative study of early-onset and late-onset cases based on data from the Iranian hereditary colorectal cancer registry. BMC Cancer. 2022;22(1):48. doi:10.1186/s12885-021-09132-5

19.

Roshandel

Ghasemi-Kebria

Malekzadeh

RJC

. Colorectal cancer: epidemiology, risk factors, and prevention. Cancers. 2024;16(8):1530.

20.

Qin

R-H

, et al. Effect of fruquintinib vs placebo on overall survival in patients with previously treated metastatic colorectal cancer: the FRESCO randomized clinical trial. Jama. 2018;319(24):2486-2496.

21.

Maajani

Khodadost

Fattahi

, et al. Survival rate of colorectal cancer in Iran: a systematic review and meta-analysis. Asian Pac J Cancer Prev. 2019;20(1):13-21. doi:10.31557/apjcp.2019.20.1.13

22.

Jiang

Yuan

, et al. Global pattern and trends of colorectal cancer survival: a systematic review of population-based registration data. Cancer Biol Med. 2022;19(2):175.

23.

Dulskas

Gaizauskas

Kildusiene

Samalavicius

Smailyte

Improvement of survival over time for colorectal cancer patients: a population-based study. J Clin Med. 2020;9(12). doi:10.3390/jcm9124038

24.

C-M

Yang

Y-W

Lin

J-K

, et al. Modeling the survival of colorectal cancer patients based on colonoscopic features in a feature ensemble vision transformer. Comp Med Imag Graph. 2023;107:102242. doi: 10.1016/j.compmedimag.2023.102242

25.

Rahmani

Yousefpoor

, et al. Machine learning (ML) in medicine: review, applications, and challenges. Mathematics. 2021;9(22):2970.

26.

Handelman

Kok

Chandra

, et al. eD octor: machine learning and the future of medicine. J Int Med. 2018;284(6):603-619.

27.

Bote-Curiel

Munoz-Romero

Gerrero-Curieses

Rojo-Álvarez

JL.

Deep learning and big data in healthcare: a double review for critical beginners. Appl Sci. 2019;9(11):2331.

28.

Cruz

Wishart

DS.

Applications of machine learning in cancer prediction and prognosis. Cancer Inform. 2006;2:117693510600200030. doi:10.1177/117693510600200030

29.

Venkatesh

Anuradha

A review of feature selection and its methods. Cyb Inform Technol. 2019;19(1):3-26.

30.

Chandrashekar

Sahin

A survey on feature selection methods. Comp Elect Eng. 2014;40(1):16-28.

31.

Nopour

Prediction of five-year survival among esophageal cancer patients using machine learning. Heliyon. 2023;9:1-15.

32.

Nopour

Design of risk prediction model for esophageal cancer based on machine learning approach. Heliyon. 2024;10:1-11.

33.

Yeh

Hsu

, et al. Prediction of fatty liver disease using machine learning algorithms. Comp Meth Prog Biomed. 2019;170:23-29. doi: 10.1016/j.cmpb.2018.12.032

34.

Nopour

Establishment of prediction model for mortality risk of pancreatic cancer: a retrospective study. BMC Med Inform Dec Mak. 2024;24(1):181. doi:10.1186/s12911-024-02590-4

35.

Buk Cardoso

Cunha Parro

Verzinhasse Peres

, et al. Machine learning for predicting survival of colorectal cancer patients. Sci Rep. 2023;13(1):8874.

36.

Yang

, et al. A multi-omics machine learning framework in predicting the survival of colorectal cancer patients. Comp Biol Med. 2022;146:105516. doi: 10.1016/j.compbiomed.2022.105516

37.

Bibault

J-E

Chang

Xing

Development and validation of a model to predict survival in colorectal cancer using a gradient-boosted machine. Gut. 2021;70(5):884-889.

38.

Achilonu

Fabian

Bebington

, et al. Predicting colorectal cancer recurrence and patient survival using supervised machine learning approach: a South African population-based study. Original Research. Frontiers in Public Health. 2021;9:694306.

39.

Pourhoseingholi

Kheirian

Zali

MR.

Comparison of basic and ensemble data mining methods in predicting 5-year survival of colorectal cancer patients. Acta Inform Med. 2017;25(4):254-258. doi:10.5455/aim.2017.25.254-258

40.

BalajiVicharapu

Patnala

SCM

. A study on various machine learning techniques used for colorectal cancer disease prediction and survival. Annals of the Romanian Society for Cell Biology. 2020:748-763.

Development of Prediction Model for 5-year Survival of Colorectal Cancer

Abstract

Objectives:

Methods:

Results:

Conclusion:

Keywords

Highlights

Introduction

Methods

Community of study and database characteristics

Database preprocessing

Input and outcome variables

Feature selection

Model establishment and assessment

External validation assessment

Result

Database preparation and sample characteristics

Feature selection

Model development and assessment

External validation test and feature assessment

Discussion

Limitation

Conclusion

Future Directions

Footnotes

Acknowledgements

Author Contribution

Declaration Of Conflicting Interests:

Funding:

Availability of Data and Materials

Ethics Approval and Consent to Participate

Consent for Publication

ORCID iD

References