Abstract
BACKGROUND:
Coronavirus disease 2019 (COVID-19) is a deadly viral infection spreading rapidly around the world since its outbreak in 2019. In the worst case a patient’s organ may fail leading to death. Therefore, early diagnosis is crucial to provide patients with adequate and effective treatment.
OBJECTIVE:
This paper aims to build machine learning prediction models to automatically diagnose COVID-19 severity with clinical and computed tomography (CT) radiomics features.
METHOD:
P-V-Net was used to segment the lung parenchyma and then radiomics was used to extract CT radiomics features from the segmented lung parenchyma regions. Over-sampling, under-sampling, and a combination of over- and under-sampling methods were used to solve the data imbalance problem. RandomForest was used to screen out the optimal number of features. Eight different machine learning classification algorithms were used to analyze the data.
RESULTS:
The experimental results showed that the COVID-19 mild-severe prediction model trained with clinical and CT radiomics features had the best prediction results. The accuracy of the GBDT classifier was 0.931, the ROUAUC 0.942, and the AUCPRC 0.694, which indicated it was better than other classifiers.
CONCLUSION:
This study can help clinicians identify patients at risk of severe COVID-19 deterioration early on and provide some treatment for these patients as soon as possible. It can also assist physicians in prognostic efficacy assessment and decision making.
Introduction
Coronavirus disease 2019 (COVID-19) is a deadly viral infection that has spread rapidly around the world since its outbreak in 2019. According to a report from the World Health Organization, as of 25 May 2022, there have been more than 524 million confirmed cases of COVID-19, including more than 6.28 million deaths 0. The severity of COVID-19 can be classified into the following categories: mild, ordinary, severe, and critical [2]. Patients with severe COVID-19 may suffer from massive alveolar damage and respiratory failure, leading to death [3]. Therefore, early classification of COVID-19 and effective targeted treatment for critically ill patients can reduce the risk of complications. Early and automatic diagnosis will help countries all over the world provide timely treatment and quarantine. Hospitals can also offer more professional treatment for severe COVID-19 patients.
Nucleic acid screening, clinical features, epidemiological features, and imaging findings are basic for diagnosing COVID-19 [4]. After comparing the diagnostic effects of RT-PCR tests and chest computed tomography (CT) on the initial negative to positive diagnosis, AI [5] concluded that chest CT detection is faster. Chest CT is recommended as a routine test for surveillance and diagnosis of COVID-19 due to imaging features such as ground-glass opacity and consolidation on Chest CT used to determine SARS-CoV-2 infection-associated pneumonia [6]. A chest CT also assists physicians in identifying the early stages of lung infection [8] and is beneficial in helping governments to establish greater public health surveillance and response systems [10]. A combined assessment using clinical records and imaging features allows for a more accurate early diagnosis of patients with COVID-19.
Although some studies have constructed machine learning (ML) prediction models, to our knowledge, the current diagnosis of COVID-19 does not achieve satisfactory accuracy. Terwangne et al. [11] used 295 RT-PCR-positive COVID-19 patients data to develop a model based on the Bayesian network to predict the severity grading of COVID-19 patients and finally obtained a AUC of 83.8%. Yao et al. [12] developed a model to predict the severity of COVID-19 based on the SVM algorithm using data from 137 COVID-19 patients (75 severe, 62 mild) and finally obtained an accuracy of 81.48%.
Liang et al. [13] developed a model for predicting the severity of COVID-19 patients based on the SVM algorithm using data from 172 patients (60 severe) and finally obtained an average accuracy of 91.83%. Liang et al. [14] used data from 1590 (131 severe) patients to develop a model based on the LR algorithm for predicting clinical risk scores for the occurrence of critical illness in hospitalized patients with COVID-19, ultimately obtaining an AUC of 88% on the validation set.
Zhu et al. [15] used data from 127 patients (16 severe groups) to develop a model to assess the severity of infection in COVID-19 patients based on the LR algorithm, ultimately obtaining an AUC of 90.0%.
The above studies related to the severity classification of COVID-19 were based on clinical features only. They did not consider the influence of CT image feature factors on the classification of mild to severe disease. Because of this, we use clinical examination data and CT radiomics features to build a prediction model for mild and severe COVID-19 patients. The study aims to segment the region of interest in the lung, extract CT radiomics features based on the segmentation results, and train a machine learning prediction model using the clinical records and the extracted image features. In the following sections, data collection and modeling methods will be described first, followed by developing COVID-19 patient severity detection models using traditional and ensemble machine learning algorithms.
Materials and methods
The clinical and chest CT data used for this retrospective study were collected by the Shanghai Public Health Clinical Center from 20 January to 29 May 29 2020. The Ethics Committee approved the retrospective study of the Shanghai Public Health Clinical Center. Figure 1 shows the overall flowchart we developed for detecting the severity of COVID-19. Firstly, SP-V-Net was used to segment the CT images to obtain the lung contours. Radiomics was used to extract the lung image features, which were combined with clinical features to build a mild-severe diagnostic model after feature screening.
The 427 clinical data collected were used for this study, including 387 patients with mild COVID-19 (mean age, 40.26
Baseline characteristics of COVID-19 patients with mild and severe disease
Baseline characteristics of COVID-19 patients with mild and severe disease
Flowchart of our approach to building the COVID-19 severity diagnosis model.
By observing the basic situation of the clinical dataset, we found that some clinical features had missing data, and the number of missing data and the proportion of missing data for different features are shown in Table 2. To reduce the impact of missing data on the experimental results, we use the Multivariate Imputation by Chained Equation (MICE) [16] method to interpolate the missing data. MICE is a multiple interpolation method that works iteratively and can resolve the uncertainty of missing values by creating multiple interpolations. The data can be interpolated by variable by specifying an interpolation model for each variable. The MICE interpolation technique has better robustness and better accounts for uncertainty and is selected to treat missing data values.
Features and missing values in the dataset
SP-V-Net [17] is a lung parenchyma segmentation model based on image deformation. The advantage of this model is that it uses 3D V-Net for end-to-end lung extraction and combines the spatial transform network (STN) module and prior shape knowledge to refine the V-Net output results so that the final segmentation results are closer to the ground-truth label. First, the threshold segmentation results were used as the lung lobes shape prior, combined with the gold standard data to train the SP-V-Net segmentation model. Second, lung lobes of each patient were segmented automatically by SP-V-Net; experienced operators confirmed the CT image segmentation results. Finally, we used the binary image of the segmentation result to multiply the original image to obtain all CT lung ROIs of 427 patients. Radiomics was used to extract the CT radiomics features from lung ROIs for machine learning; 120 features presented by Zwanenburg et al. [18] were measured. All of them may be related to COVID-19 classification are extracted for our analysis.
Feature selection
The feature selection algorithm effectively reduces the feature number and also helps to improve accuracy in many cases [19]. Feature selection algorithms can effectively remove those unrelated features [20], which usually enhances the model’s generalization performance. We used the random forest to rank features and selected the top-ranked features. Due to the random nature of the Random Forest algorithm, we trained the model several times, choosing a certain number of features each time and using the intersection of the results of multiple experiments as the last selected features.
Model selection
By comparing different classes of machine learning classifiers, we consider training the following classifier model to predict the severity of COVID-19.
AdaBoost (Adaptive Boosting) [21] is an iterative algorithm that trains different weak classifiers for the same training set by increasing the weights of misclassified data and decreasing the importance of correctly classified data. Finally, AdaBoost combines these weak classifiers linearly to form a robust classifier. GBDT (Gradient Boosted Decision Tree) [22] is an ensemble algorithm that produces a weak classifier in each iteration by multiple iterations. The total classifier is obtained by weighting and aggregating the weak classifiers, which improves prediction accuracy. MLP (Multilayer Perceptron) [23] is a feed-forward ANN containing at least three layers of neurons, trained with back-propagation supervised learning techniques, which can identify not linearly separable data. XGBoost (Extreme gradient boosting) [24] is a scalable tree-boosting machine learning system in all scenarios and can solve real-world scale problems using minimal resources. KNN (K-nearest neighbors) [25] is a simple and effective classification algorithm that performs classification by measuring the distance between different feature values. LR (Logistic Regression) [26] is a classical classification method in supervised learning and is often used to deal with regression problems in which the dependent variable is categorical. Logistic regression is often used to analyze medical research risk factors for a particular disease. NB (Naive Bayes) [27] is one of the most effective inductive learning algorithms in data mining and machine learning, and it has surprising performance in classification. RF (Random Forest) [28] is an ensemble algorithm that can solve the data imbalance problem by decision tree voting to get the final prediction and can be used for feature selection by providing the relative importance of different features in the classification process.
Data sampling methods
Data imbalance is one of the current challenges in data analysis, which usually leads to over-fitting models. To further describe the data imbalance, we represent the minority class sample by using
When
There was also a data imbalance in this study, with 387 cases of mild disease and 40 cases of severe disease in the data we collected, with a data imbalance ratio of 9.675. To further address the effect of data imbalance on the experimental results, we sampled the data in three different ways (Under-sampling, Over-sampling, and Combination of over- and under-sampling methods) for all the data separately.
ClusterCentroids [29] use KMeans to cluster each sample type separately, replacing the entire cluster of samples using the center of mass. RandUnder (Random Under Sampling) [30] randomly selects samples from the majority class samples for rejection. NearMiss [31] selected the most representative samples from most classes for training to alleviate the problem of information loss in random undersampling. TomekLink [32] represents the nearest pair of samples between different categories, which are nearest neighbors of each other and belong to different classes. ENN (Edited Nearest Neighbor) [33] traverses the samples of most classes, and if most of the k-nearest neighbor samples are not the same as their class, they are deleted. RENN (Repeated Edited Nearest Neighbor) keeps repeating the deletion process of ENN until it can no longer be deleted. CNN (Condensed Nearest Neighbor) [34] uses the nearest neighbor approach to iterate and determine whether a sample should be retained or rejected. OSS (One Side Sampling) [35] rejects noisy samples by using multiple TomekLink iterations. AllKNN [36] applies ENN multiple times and will change the number of nearest neighbors.
RandOver (Random Over-Sampling) randomly samples from categories with few samples, and then adds the sampled samples to the data set. SMOTE (synthetic minority oversampling technique) [37] interpolates between a few classes of samples to generate additional samples. BorderSMOTE (Borderline Synthetic Minority Oversampling Technique) [38] first distinguishes the minority class samples located at the border and performs KNN sampling for these samples. KMeansSMOTE [39] first applies KMeans clustering and then oversamples using SMOTE. SVMSMOTE [40] uses the SVM classifier to generate support vectors to generate new minority class samples, which are then synthesized using SMOTE. ADASYN (adaptive synthetic sampling) [41] uses some mechanism to automatically determine how many synthetic samples need to be generated for each minority class sample.
SMOTETomek (SMOTE with tomek links cleaning) [42] combines over and under sampling using SMOTE and Tomek links.
3-fold cross-validation was used to prove the performance of the model on the training data set. The training data consist of 67% of the total data; the test data consist of the remaining 33%. Among all 427 sets of patient data, 285 (258 mild, 27 severe) data sets were used as training data, and 142 (129 mild, 13 severe) data sets were used as test data. Finally, we used training data to train eight different machine learning algorithms and the trained models to predict the test data. Accuracy, f1-score, AUC, and AUCPRC were used to analyze the model’s performance. Figure 2 shows the workflow of our method.
Overall training flowchart of the COVID-19 mild-severe disease prediction model.
To better measure the model’s effectiveness, we used different evaluation indicators to compare the multiple aspects. According to the confusion matrix, the following indicators were used to evaluate the performance of the model comprehensively:
TP, FP, TN, and FN stand for True Positive, False Positive, True Negatives, and False Negative respectively.
Feature selection
To further verify the effect of different features on machine learning prediction results, we performed 100 RandomForest feature sorting on clinical features, CT radiomics features, a mixture of clinical features, and CT radiomics features, respectively. The final mean value of 100 experimental results was calculated as the final feature importance ranking result. We further filtered the number of features that make different machine learning algorithms get optimal results based on the feature ranking results. The optimal number of feature combinations for different machine learning models is shown in Fig. 3.
Number of features for each machine learning algorithm to obtain optimal prediction results.
Experimental results of AUCPRC obtained by various machine learning algorithms using different types of features.
The order of feature selection was sorted according to the score of the random forest algorithm from high to low. The feature importance scores of the clinical features and the CT extracted features are shown in Table 3. To get the best prediction results, we used gridSearchCV to optimize the parameters of each machine learning model. The optimal model parameters are shown in Table 4.
Feature importance scores
The optimal model parameters
After parameter optimization and combining the optimal features, we trained the COVID-19 mild-severe prediction models with eight machine learning algorithms. Due to some imbalance in the experimental data, we finally chose the PRAUC value to verify the model’s goodness. Ten 3-fold cross-validation experiments were conducted separately, and the mean values were obtained as the final results of the experiments. Figure 4 shows the experimental results of AUCPRC obtained by each machine learning algorithm using different types of features. By comparing the experimental results, we can see that the final results obtained by different machine learning algorithms using the combined Clinical and CT radiomics features are higher than those obtained with clinical or CT radiomics features alone. The experimental results also further demonstrate the more accurate results of the machine learning classification model built using clinical and CT radiomics features. Therefore, to further verify the effect of data imbalance on the experimental results. We will select combined clinical and clinical feature data for further sampling experiments in the following experiments.
In the following data sampling experiments, 3-fold cross-validation was used to divide the data, and then the divided data were combined with different sampling methods for the sampling experiments. The Under-sampling methods were used to sample 0.2, 0.4, 0.6, 0.8, and 1.0 times the majority of the data. The Over-sampling methods were used to sample 0.1, 0.2, 0.4, 0.6, and 0.8 times the majority of the data for the minority data. All sampling methods were performed on the training set data, and no processing was done on the test set data. Ten 3-fold cross-validation experiments were conducted separately, and the average of the ten experimental results was used as the final result. Figure 5 shows the results without the data sampling process. Figure 6 shows the optimal results after data sampling. By comparing Figs 5 and 6, it can be found that the model results still improve after sampling the data. RandomForest has the largest PRAUC value improvement of 3.7% after using the sampling method, and GBDT has the optimal PRAUC result of 0.697 in the used model after the sampling process.
The optimal prediction results were obtained by each machine learning model trained using unsampled data.
The optimal prediction results were obtained by each machine learning model trained using sampled processed data.
The optimal prediction results of each machine learning model are obtained by training with data processed in different data sampling methods
To further verify the final results of each machine learning algorithm under different sampling methods and sampling ratios, we collated all the results after sampling, and the final experimental results are shown in Table 5. By analyzing the experimental results in Table 5, we can find that the final results obtained by different models combined with different sampling methods and sampling ratios are different. The experimental results of all eight machine learning experiments were further improved after using the data after data sampling and processing. RandomForest obtained the largest AUCPRC value improvement of 2.7% for the data after sampling with SVMSMOTE. GBDT got the largest AUCPRC value of 0.697 among the eight machine learning prediction models after sampling with SVMSMOTE. GBDT used the RandOver sampling method to make the final combined performance optimal among the eight different machine learning algorithms, the accuracy is 0.931, the AUC is 0.942, and the AUCPRC is 0.694. The results can also provide some reference value for data imbalance experiments combining different algorithms and sampling methods to optimize the experimental results.
The purpose of this study was to develop a diagnostic model for predicting the severity of patients with COVID-19. Using clinical features and CT radiomics features, the optimal prediction accuracy of 0.932 and AUC value of 0.942 were obtained for the diagnostic model based on the GBDT algorithm after data sampling processing and feature selection. The model can assist clinicians in screening patients with severe COVID-19, providing more medical resources for these patients, and can also be used to improve patient prognosis decisions and assess prognostic treatment outcomes.
Several studies have been conducted to build diagnostic models for the severity of COVID-19 patients using machine learning algorithms, and details of the diagnostic models are shown in Table 6. These studies used patients’ clinical features to build diagnostic models and did not consider the impact of CT radiomics features on model prediction performance. In our study, to further validate the effect of CT radiomics features on the prediction model’s performance, we segmented the ROI on the chest CT by using SP-V-Net and extracted the CT radiomics features on the ROI. Experiments were performed separately using clinical features, CT radiomics features, and a mixture of both. Figure 4 shows that the COVID-19 mild-severe prediction model built by using a mixture of features of both has better performance.
Comparison of machine learning-based methods for COVID-19 mild and severe diagnostic studies
Comparison of machine learning-based methods for COVID-19 mild and severe diagnostic studies
The data imbalance problem was widely present in many real-world decision problems, and in medical diagnosis models, the data imbalance problem can have a negative impact on experimental results. The problem of data imbalance in the studies of Liang et al. [14] and Zhu et al. [15] was not addressed. In our research, to reduce the impact of the data imbalance problem on the prediction results, we sampled the data using three different data sampling methods. By comparing Figs 5 and 6, it can be found that the prediction results of different models were still further improved after the data imbalance treatment. In this study, the optimal prediction results were obtained using integrated models such as GBDT, AdaBoost, and XGB combined with the data after sampling. These classifiers all used integrated learning techniques to improve the accuracy of individual classifiers and overall classifier performance. And related studies have demonstrated that the model prediction performance can be further improved by using integrated models when dealing with data imbalance problems [43].
By further feature ranking and screening of a mixture of CT radiomics features and clinical features, we found that PO2, age, PCT, LDH, and CRP were the five most important clinical risk factors with the most severe degree of prognostic observation for patients with COVID-19, and this result was also consistent with previous related studies. Studies have shown that older age, elevated PCT, LDH, and CRP are all important correlates of the severity of COVID-19 [17]. More important is the continued importance of image features extracted from chest CT to diagnose mild to severe disease in patients with COVID-19. The combination of clinical features and chest CT extracted features has a good performance in diagnosing the severity of COVID-19. COVID-19 patients have specific chest CT image features, including ground-glass opacities (GGO), multifocal patchy consolidation, or interstitial changes in the peripheral distribution [44]. The increase in lesion volume, gross glass volume, and other volumes also provided the possibility for the model to predict the severity of COVID-19.
Machine learning algorithms are now widely used in complementary medical diagnosis and are playing an increasingly important role. Catic et al. [45] built prenatal diagnosis classification models using Artificial Neural Networks (ANNs) artificial neural networks to help physicians in their daily work, eventually obtaining feedforward neural networks with an average accuracy of 89.6% and feedback of 98.8%. Begic et al. [46] applied machine learning algorithms to diagnose congenital heart defects, obtaining a diagnostic accuracy of 94.28% by making the model built by SVM. Stokes et al. [47] applied the trained machine learning model to diagnose and refer to bronchitis and pneumonia. They obtained a 93% AUC value model performance by using decision trees. In our study, the COVID-19 mild-severe diagnostic model built using the GBDT algorithm obtained optimal predictive performance with an accuracy of 93.2% and an AUC of 94.2%. Compared with the above three studies, the model we developed also has good performance in diagnosing mild and severe COVID-19, which can assist physicians in the early detection of severe COVID-19 patients and provide them with better medical resources, and this study has better clinical significance. ML can help improve disease-specific diagnostic systems’ reliability, performance and accuracy. The research and application of ML in the medical field are also increasing, and the related research will provide more convenience for doctors and patients.
The present study still has some limitations that need to be considered. First, the number of patients with COVID-19 is relatively small, limiting the accuracy of the predictive model. Second, the diversity of data in our study is limited, all subjects are Chinese COVID-19 patients, and the results may not be fully applicable to data from other countries. Third, the number of severe patients’ data is small, and the mild and severe patients’ data are imbalanced. We need to collect more COVID-19 patient data, especially severe patients. Further research is still required.
This study proves that the COVID-19 mild-severe prediction model based on features extracted from chest CT and clinical characteristics can effectively differentiate the severity of COVID-19 patients and can provide helpful insights for early diagnosis of patients with COVID-19 mild-severe disease. And the prediction models based on both chest CT features and clinical features had higher prediction performance than those built using the two different types of data alone. The results could help clinicians more effectively assess the severity of COVID-19 patients and stratify patients for treatment to reduce potential mortality and ease the burden of care.
Footnotes
Acknowledgments
This research was funded by Henan Science and Technology Development Plan 2022 (grant number 222102210219) and Zhengzhou University of Light Industry (grant number 2019ZCKJ228).
Ethics statement
The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board (or Ethics Committee) of the Shanghai Public Health Clinical Center (YJ-2020-S035-01, approval on 22 February 2020).
