Abstract
Background
The severity of coronavirus (COVID-19) in patients with chronic comorbidities is much higher than in other patients, which can lead to their death. Machine learning (ML) algorithms as a potential solution for rapid and early clinical evaluation of the severity of the disease can help in allocating and prioritizing resources to reduce mortality.
Objective
The objective of this study was to predict the mortality risk and length of stay (LoS) of patients with COVID-19 and history of chronic comorbidities using ML algorithms.
Methods
This retrospective study was conducted by reviewing the medical records of COVID-19 patients with a history of chronic comorbidities from March 2020 to January 2021 in Afzalipour Hospital in Kerman, Iran. The outcome of patients, hospitalization was recorded as discharge or death. The filtering technique used to score the features and well-known ML algorithms were applied to predict the risk of mortality and LoS of patients. Ensemble Learning methods is also used. To evaluate the performance of the models, different measures including F1, precision, recall, and accuracy were calculated. The TRIPOD guideline assessed transparent reporting.
Results
This study was performed on 1291 patients, including 900 alive and 391 dead patients. Shortness of breath (53.6%), fever (30.1%), and cough (25.3%) were the three most common symptoms in patients. Diabetes mellitus(DM) (31.3%), hypertension (HTN) (27.3%), and ischemic heart disease (IHD) (14.2%) were the three most common chronic comorbidities of patients. Twenty-six important factors were extracted from each patient's record. Gradient boosting model with 84.15% accuracy was the best model for predicting mortality risk and multilayer perceptron (MLP) with rectified linear unit function (MSE = 38.96) was the best model for predicting the LoS. The most common chronic comorbidities among these patients were DM (31.3%), HTN (27.3%), and IHD (14.2%). The most important factors in predicting the risk of mortality were hyperlipidemia, diabetes, asthma, and cancer, and in predicting LoS was shortness of breath.
Conclusion
The results of this study showed that the use of ML algorithms can be a good tool to predict the risk of mortality and LoS of patients with COVID-19 and chronic comorbidities based on physiological conditions, symptoms, and demographic information of patients. The Gradient boosting and MLP algorithms can quickly identify patients at risk of death or long-term hospitalization and notify physicians to do appropriate interventions.
Introduction
In December 2019, coronavirus disease (COVID-19) emerged in China and spread rapidly around the world. 1 This virus is highly contagious in humans. 2 In January 2020, the World Health Organization announced the outbreak of COVID-19 at the Public Health Emergency of International Concern. 3 In October 2020, nearly 40 million people in more than 180 countries became infected and more than one million died. 4 Data from the early months of the outbreak showed that COVID-19 is more common in patients with chronic comorbidities such as cardiovascular disease, kidney disease, type 2 diabetes, hypertension, and cancer.5,6 A retrospective study in China also showed that out of 138 patients with COVID-19, 64 (46.4%) had one or more chronic comorbidities. 7 Evidence shows that the outbreak of the COVID-19 virus has been a serious threat to patients with chronic comorbidities because these patients have a more severe form of respiratory problems than other COVID-19 patients, as it can increase their length of stay (LoS) in hospitals or even their death rate.8,9
Numerous studies have used machine learning (ML) algorithms to predict survival and calculate the LoS of patients with chronic comorbidities.10–12 Survival can be defined as the time interval between the diagnosis of the disease and the death of the patient. 13 Factors such as chronic comorbidities and pandemics can reduce patient survival. The LoS is one of the criteria for measuring the utilization of a hospital 14 because the reduction of patients’ LoS will lead to the optimal use of medical resources available in the hospital, such as hospital beds, staff, etc. 15
So far, ML as a subset of artificial intelligence techniques has been successful in predicting the rapid recovery of many chronic comorbidities such as diabetes16,17 and cancer
18
and reducing the LoS
19
and mortality from cardiovascular disease.
20
Accurate prediction of the mortality risk and reducing the LoS of patients reduce the pressure on healthcare systems and support medical decisions. Recently, several studies have used ML algorithms to predict the risk of mortality21–24 and calculate the LoS12,25–27 in patients with COVID-19. Wang et al.
28
used two ML models based on clinical and laboratory features to predict the mortality risk of COVID-19 patients. In South Korea, a study demonstrated Lasso and linear support vector machine (SVM) had high sensitivities and specificities to predict the mortality risk of COVID-19 patients.
29
In addition, Jimenez-Solem et al.
30
used data from patients with COVID-19 in Denmark and the United Kingdom to develop mortality prediction models for COVID-19 patients. To predict the LoS, Ebinger et al.
12
trained three ML models with an accuracy of 0.765, which were based on the analysis of electronic health records of 966 COVID-19 patients in a large educational and medical center in the United States. In Saudi Arabia, a study predicted the LoS of COVID-19 patients in intensive care unit (ICU) with the highest accuracy (94.16%) using the Random Forest model.
27
A study in Iran showed that among the seven ML techniques, the SVM algorithm with an average accuracy of 99.5%, average specificity of 99.7%, and average sensitivity of 99.4% had the best performance on the laboratory data of 1225 COVID-19 patients.
31
A systematic study reported that variables such as age, gender, and chronic comorbidities including hypertension and diabetes played an important role in increasing the risk of death and LoS of patients with COVID-19.
32
Also, several studies have shown a higher risk of death, and LoS in COVID-19 patients with a history of chronic comorbidities such as hypertension, diabetes, and acute respiratory distress syndrome than others.33–40 Although these studies have yielded interesting results, some of them used standard biostatistics methods for their calculations, leaving room for ML approaches.34–36 In addition, some studies just included a specific type of inpatients, such as patients in ICU.37–39 The objective of this study was to predict the risk of mortality and LoS of COVID-19 patients with a history of any chronic comorbidities by comparing the performance of selected ML algorithms and identifying the most important clinical variables. In general, we seek to answer the following questions concerning COVID-19 in patients with a history of any chronic comorbidities:
What are the best ML algorithms for predicting their risk of mortality? What are the best ML algorithms for predicting their LoS? What are the most important clinical variables for predicting their risk of mortality? What are the most important clinical variables for predicting their LoS?
We present this article in accordance with the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) reporting checklist (Supplemental Appendix 1).
Methods
This retrospective study was conducted on all patients with COVID-19 who visited Afzalipour Hospital (the main treatment center for COVID-19 patients in Kerman, the largest province in Iran) from March 2020 to January 2021. Only patients whose real-time polymerase chain reaction test was positive for COVID-19 acute respiratory syndrome and had at least one chronic comorbidity were included in this study. Patients under 18 years of age were excluded from the study because they are admitted to a children's hospital. Pregnant women were also excluded from the study. These patients should be included in the scope of pregnancy exploration. The names of the variables were extracted from the patient medical records.
Based on the review of various studies32,33,41,42 and the approval of two infectiologist, factors essential for predicting mortality and LoS of patients with COVID-19 with chronic comorbidities were identified. Data of all patients were extracted from the hospital information system and based on their Electronic Health Records (EHRs). Since not all the data were recorded in patients’ EHRs, the rest of the required data were collected by reviewing paper records. Demographic information, chronic comorbidities, patients’ symptoms at admission, as well as their discharge status (alive/dead) and LoS were collected. The data were entered into an Excel sheet. Two output variables were considered. The first variable indicates the patient health status at discharge. The values for this variable were 0 for dead and 1 for alive status. Patients that were discharged in stable condition and without any symptoms were regarded as improved patients. Patient mortality was not considered after their discharge and out of the hospital. The second variable was for the LoS of patients in the hospital. Figure 1 shows the flowchart of the patient selection stage. Out of 1538 patients with COVID-19, 1291 were included in our analysis.

Flowchart for selecting patients to participate in the study. ROC: receiver operating characteristic.
In this study, only clinical information recorded in the patient medical records was used and the identity information of patients was not used. Therefore, waiver of consent report.
Data preparation
For the data preprocessing stage, first, missing data in Excel were identified and deleted. For data normalization, the Standard Scaler method was used. In this method, u represents the mean and s represents the standard deviation. As a result, the standard value of sample x is obtained from the following equation:
Predictive analytics algorithms
Various studies were reviewed to select the best ML algorithms to predict mortality and LoS of COVID-19 patients with a history of chronic comorbidities.25,39,43 Since the performance of each ML algorithm depends on data structure and type of work, 44 the performances of several algorithms were evaluated using different criteria. In this study, Random Forest, Multilayer Perceptron, K-Nearest Neighbor (KNN), AdaBoost, Naïve Bayes, and SVM algorithms were used for the prediction of mortality. Multilayer perceptron (MLP), ElasticNet, support vector regression (SVR) and Lasso, and Ridge algorithms were also used for the prediction of LoS.
We configured the random forest algorithm 45 with 10, 50, and 100 trees in the forest. MLP was used to create a neural network model. The effective factors in predicting mortality were the inputs of neural network (n = 31) and its output was the target variable (mortality and LoS). For KNNs 1, 3, 5, 10, 15, 30, and 50 neighbors were used. For random forest analysis, bagging with 100 iterations and base learner was used.
We also used ensemble learning methods. Ensemble learning is the process in which two or more ML model are combined to get better results and improve robustness over a single estimator. There are different approaches to ensemble. These methods decrease the variance of a base estimator and minimize the overfitting of data. 46
In this study, we calculated the overall accuracy to compare ML algorithms. In addition, receiver operating characteristic (ROC) was generated for each production algorithm and their area under curve (AUC) and confusion matrix were calculated. ROC is often used to determine the strength of a model. In medicine, ROC is used to evaluate the precision of diagnostic tests. The precision of 0 to 0.5 indicates random classification and 0.5 to 1 indicates the overall recognition ability of the model. 47 We also ensured that there was no interference between the training and test datasets at any level.
Statistical analysis and performance evaluation
The development of a model for predicting mortality and LoS for COVID-19 patients with chronic comorbidities was performed based on Python Scikit-learn package version 3.8. For performance evaluation, 70% of the data were considered for training and 30% for testing. The efficiency of patient mortality risk prediction models was evaluated by calculating AUC, ROC, precision, specificity, accuracy, F1 score, and recall. These criteria are defined and calculated using the confusion matrix components (Table 1).
Confusion matrix.
FN: dead people incorrectly identified as alive; FP: alive people incorrectly identified as dead; TN: alive people correctly identified as dead; TP: dead people correctly diagnosed as dead.
Assume that the number of positive examples (number of alive patients) and negative (number of dead patients) are P and N, respectively, the following definitions can be given:
FP = Alive people incorrectly identified as dead TP = Dead people correctly diagnosed as dead TN = Alive people correctly identified as dead FN = Dead people incorrectly identified as alive
Accuracy refers to the number of alive and dead people who have been correctly diagnosed as alive or dead.
48
Precision refers to the number of people who have died and the model has correctly identified them as dead. Sensitivity refers to people who have died, and the model has correctly identified them as dead. Thus, more sensitivity indicates a more accurate diagnosis of the number of dead. Specificity refers to the proportion of people who are alive, and the model correctly identifies them as alive.
49
Mean squared error (MSE), root mean squared error (RMSE), and mean absolute error (MAE) were used to compare the models for predicting the LoS of patients in the hospital.
Ethical considerations
This study was approved by the ethics committee of Kerman University of Medical Sciences (code of ethics: (IR.KMU.REC.1400.055). To protect the privacy and confidentiality of the patients, we concealed the unique identifying information of all the patients in the data collection process.
Results
Patients’ characteristics
The dataset contained 1291 patient records including 900 (69.6%) alive and 391 (30.3%) dead patients (Figure 1). The mean ages of dead and alive patients were 66.2 and 53.9 years, respectively. Also, 54.6% of patients were male.
Shortness of breath (53.6%), fever (30.1%), and cough (25.3%) were the three most common symptoms in patients. Diabetes mellitus (31.3%), hypertension (27.3%), and ischemic heart disease (IHD) (14.2%) were the three most common chronic comorbidities of patients.
In total, based on the studies and experts’ opinions, 26 characteristics were specified to predict the risk of mortality and LoS of COVID-19 patients with a history of chronic comorbidities (Table 2). The degree of importance of each feature in predicting mortality risk and LoS is shown in Table 2. The most effective chronic comorbidity for predicting mortality is hypertension, and for predicting LOS is asthma. Shortness of breath and sore throat are the most important symptoms for predicting mortality and LoS, respectively.
Risk factors of mortality and LoS of COVID-19 patients with chronic disease.
CKD: chronic kidney disease; COPD: chronic obstructive pulmonary disease; DM: diabetes mellitus; ESRD: end stage renal disease; HLP: hyperlipidemia; IHD: ischemic heart disease; LoS: length of stay; TSH: hypothyroidism.
Prediction of mortality requires the use of classification methods, whereas the increased LoS is a regression problem. There are several reasons that, compared to traditional methods, ML models can predict better. For instance, ML models can make predictions based on a much larger data set than traditional methods. On the other hand, ML is not biased by human emotions or subjective opinions. Furthermore, ML models can adapt to changes quickly. Finally, they can identify patterns that are too complex.
Because the number of null values in the feature is not above a certain threshold (5% of total data), we adopt RandomOverSampler for balancing data. It over-samples the minority class(es) by picking samples at random with replacement.
To predict the mortality risk of COVID-19 patients with a history of chronic comorbidities, simple Bayesian and five ML algorithms were implemented. A comparison of their results in terms of specificity, sensitivity, accuracy, and ROC curve is shown in Table 3. The results of this table show that SVM is the best model for predicting mortality risk based on most evaluation metrics. The simple Bayesian model is the weakest model based on the F1-score metric.
The performance of Naïve Bayes and five machine learning models in mortality prediction.
*In each column, the best result is shown in bold.
MLP: multilayer perceptron; ROC: receiver operating characteristic; SVM: COVID.
In Table 4, you can see the prediction based on ensemble learning methods. As observed, Gradient boosting achieved the best results compared to other methods.
The performance of ensemble learning methods.
*In each column, the best result is shown in bold. ROC: receiver operating characteristic.
The performance of the selected models based on the ROC curve and confusion matrix is shown in Figures 2,3,4,5, respectively. The possible activation functions that can be used in Neural Networks are Sigmoid, Tanh(x), and rectified linear unit (ReLU) functions. According to the results of Table 3, the average accuracy was about 74.11%. This means that the prediction of about 74 out of every 100 data items given to the network is correct.

Performance of mortality risk prediction models based on ROC curve. AUC: area under curve; ROC: receiver operating characteristic.

Performance of the best mortality risk prediction model (SVM) based on the confusion matrix. SVM: support vector machines.

Performance of the best mortality risk prediction ensembling model (Gradient Boosting) based on the confusion matrix.

Performance of mortality risk prediction ensembling models based on ROC curve. AUC: area under curve; ROC: receiver operating characteristic.
For the MLP method, instead of using a pretrained model, we built our model and trained it from scratch with our data. We adopted 100, 10, and 0.5 for epoch number, batch size, and dropout, respectively. We used the EarlyStopping function in the Keras library which monitors the accurate and loss values. If the loss is being monitored, training comes to halt when there is an increment observed in the loss values.
Model explanation is also a necessity in the perception of aggregated ML models. We adopt Local interpretable model-agnostic explanations (LIME) for this purpose to illustrate explanations for any single patient. For example, model explanation is depicted for patient #1 in Figure 6. The colors show the associations between features and the prediction. The colors blue and orange

LIME feature plot states the effect of each variable on the classification.
The results of comparing the lowest error rate of ML algorithms in predicting the LoS of COVID-19 patients with comorbidities are shown in Table 5. According to the results of this table, MLP (32*1024*32) with ReLU activity function is the best model for predicting the LoS of patients based on the considered metrics.
The performance of various machine learning models for predicting the length of stay.
*In each column, the best result is shown in bold. MAE: mean absolute error; MLP: multilayer perceptron; MSE: mean square error; ReLU: rectified linear unit; RMSE: root mean squared error; SVR: support vector regression.
Discussion
In the event of an outbreak, the prediction of mortality and LoS of patients is inevitable for resource management in healthcare facilities. In the present study, several ML algorithms were developed to predict the risk of mortality and LoS using demographic indicators, clinical symptoms, and chronic comorbidities of patients with COVID-19. This study was conducted on 900 alive and 391 dead patients. Based on the results, SVM and MLP algorithms with ReLU activation function had the best performance in predicting mortality risk and LoS of COVID-19 patients with chronic comorbidities, respectively.
Numerous studies have so far developed mortality prediction models for patients with COVID-1929,50,51 but they did not focus specifically on patients with comorbidities. Among the six ML algorithms used in this study, SVM with an accuracy of 80% and ROC of 0.85 performed better than other models in predicting the risk of mortality in patients. These findings were consistent with the results of other studies. For example, in a study by Agieb et al., 52 SVM was the most successful model in predicting mortality. Similarly, another study by Booth et al. 53 reported that SVM is the best model with 91% sensitivity and specificity.
Numerous studies have also developed models for predicting the LoS of patients with COVID-1938,54,55 but they did not include chronic comorbidities patients or target a specific type of disease. In the present study, among the five ML algorithms used to predict the LoS of COVID-19 patients with chronic comorbidities, MLP with the ReLU activation function had the best performance (MAE = 0.434, RMSE = 0.624, and MSE = 0.389). These findings confirmed the results of other studies. For example, Bacchi et al. 56 reported that MLP is the best model for predicting the LoS of COVID-19 patients with the highest accuracy (MAE = 0.246, RMSE = 0.369, and AUC = 0.864). In another similar study by Kulkarni et al., 57 an MLP-based model predicted the LoS of COVID-19 patients with 90.87% accuracy.
On the other hand, several studies have investigated the role of chronic comorbidities in predicting outcomes of COVID-19.58,59 Diabetes, 60 asthma, 61 cancer, 62 hypertension, and cardiovascular diseases63,64 had a significant predictive role among chronic comorbidities. However, in the present study, Hyperlipidemia (HLP) was the most effective chronic comorbidity for predicting mortality and LoS. After HLP, other chronic comorbidities such as diabetes, asthma, and cancer played a significant role in predicting patient mortality.
Clinical symptoms play a major role in the development of complications associated with COVID-19. In the present study, judging by the importance of the ranked features, shortness of breath was the most important symptom for predicting the mortality and LoS of COVID-19 patients with chronic comorbidities in the hospital. In addition to shortness of breath, symptoms such as sore throat, fever, diarrhea, and chest pain could effectively predict mortality, and symptoms such as fever, cough, fatigue, and abdominal pain were effective in predicting LoS. The best clinical symptoms in predicting mortality and longer LoS in other studies were fever, cough, shortness of breath, and diarrhea.55,64–68
This study showed that the patient’s age effectively increases mortality and LoS. A study in the United Kingdom on 800 patients with COVID-19 and cancer showed that mortality in older patients is significantly higher. 69 In Iran, a study of 459 COVID-19 patients admitted to hospitals showed that the number of deaths increases with the age of patients. 68 In previous studies, age has been considered an independent and significant mortality index in diseases such as Middle East respiratory syndrome and Severe acute respiratory syndrome (SARS).70,71 Finally, ML algorithms can be useful for physicians and administrators involved in treating patients with COVID-19 as well as COVID-19 patients with chronic comorbidities. The proposed algorithms can predict the mortality and LoS of patients with optimal accuracy, precision, sensitivity, specificity, and ROC. The results of these predictions can lead to the optimal use of hospital resources in treating patients with more critical conditions, help to provide better care and reduce medical errors resulting from fatigue and long working hours in hospital wards. Designing credible predictive models may improve the quality of care, increase patient survival, and reduce LoS. Therefore, predictive models for analyzing the risk of mortality and LoS can help identify high-risk patients and adopt the most effective care and treatment plans.
Limitation
This study had two limitations. First, conducting this retrospective study in a single center may affect the quality of the data and the generalizability of the results. However, this hospital was the largest COVID-19 center in Kerman province and many patients from all over the province were hospitalized and treated there. Second, we did not include important prognostic factors such as laboratory and radiological biomarkers.72–74 However, according to the aim of the present study, it was sufficient to consider only the usual clinical features of patients at the time of admission.
Conclusion
The results showed that the ML algorithms developed in this study can predict the risk of mortality and LoS in COVID-19 patients with chronic comorbidities with a mean accuracy of 74% and 85%, respectively, based on their physiological conditions, symptoms, and demographic information. Senility plays an important role in increasing mortality and LoS in these patients. Due to the high mortality rate of COVID-19 patients with chronic comorbidities, we recommend future studies monitor the course of the disease in patients with chronic comorbidities who have survived death from COVID-19.
Supplemental Material
sj-docx-1-dhj-10.1177_20552076231170493 - Supplemental material for Prediction of mortality risk and duration of hospitalization of COVID-19 patients with chronic comorbidities based on machine learning algorithms
Supplemental material, sj-docx-1-dhj-10.1177_20552076231170493 for Prediction of mortality risk and duration of hospitalization of COVID-19 patients with chronic comorbidities based on machine learning algorithms by Parastoo Amiri, Mahdieh Montazeri, Fahimeh Ghasemian, Fatemeh Asadi, Saeed Niksaz, Farhad Sarafzadeh and Reza Khajouei in DIGITAL HEALTH
Footnotes
Availability of data and material
Our data or material may be available from the corresponding author or first author upon reasonable request.
Authors’ contributions
PA, FGh, and MM contributed to the study design; PA and FA collected the data; MM and SN analyzed the data; PA and MM drafted the manuscript; FA, RKh, and FS critically revised the manuscript for important intellectual content. All authors took part in the entire study and approved the final manuscript.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethics approval and consent to participate
This article was extracted from an independent research project performed in the field of medical informatics at Kerman University of Medical Sciences without organizational support. This study was approved by the ethics committee of Kerman University of Medical Sciences (code of ethics: IR.KMU.REC.1400.055) and was performed according to the ethical guidelines of the Helsinki Declaration. Also, this study was supported by the Student Research Committee of Kerman University of Medical Sciences (code: 99000625). In addition, due to the retrospective nature of the study, the ethics committee of Kerman University of Medical Sciences waived the need for written informed consent.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
