Sage Journals: Discover world-class research

Abstract

Objectives: As COVID-19 continues to wreak havoc around the world, effective screening methods for safeguarding public health and bringing the pandemic under control are urgently required. Accordingly, this study proposes a robust pipeline for training and evaluating deep learning predictive models for the detection of COVID-19 from chest X-rays. Methods: The pipeline incorporates multiple techniques to combat overfitting and optimize the prediction results, including data augmentation, Bayesian optimization, the selection of appropriate performance metrics, and the use of early stopping during model training. The proposed pipeline is applied to three common deep learning models: ResNet50, NASNet-A-Mobile, and Xception. Results: The experimental results obtained using the COVID-XRay-5K v3 dataset with 2084 training and 3100 test examples show that the three models achieve AU-PRC (area under precision-recall curve) scores of 0.977, 0.963 and 0.900, respectively; and AU-ROC (area under receiver operating characteristic) scores of 0.994, 0.996 and 0.981. Moreover, at a 98.00% recall (sensitivity), the models achieve high specificities of 97.53%, 97.60% and 86.00%, respectively. Conclusion: Overall, the results suggest that all three models provide a promising approach for the rapid and reliable detection of COVID-19.

Keywords

Chest X-rays COVID-19 detection deep learning image classification screening test

Introduction

Coronavirus Disease 2019 (COVID-19) has caused untold suffering and hardship around the world over the past 2 years. It has led to millions of deaths and left hundreds of millions of people with permanent medical complications. Moreover, it continues to have a massive economic impact and has been the cause of much social disruption and unrest. Millions of enterprises are on the brink of collapse, and nearly 1.76 billion of the global workforce are struggling to make ends meet.¹ Moreover, worldwide poverty and undernourishment levels are at unprecedented highs.²

Never before have so many of the world’s researchers focused so urgently on a single topic.^3,4 Together, they have made extraordinary progress in tackling COVID-19, most notably through the development of numerous effective vaccines and drugs. Among the various techniques currently available for COVID-19 diagnosis, rapid antigen/antibody tests, immunoenzymatic serological tests, and RT-PCR-based molecular tests are the most commonly used. However, radiological investigations, such as chest X-rays (CXRs) and computed tomography (CT), have also proven to be effective in diagnosing COVID-19 pneumonia.⁵

The deep learning field has undergone remarkable development in recent years and now routinely deployed for many applications and services. Convolutional neural networks (CNNs), in particular, have achieved great success in performing the classification and segmentation of all kinds of medical images, including magnetic resonance imaging (MRI),⁶ microscopy,⁷ CT,⁸ ultrasound,⁹ X-ray,¹⁰ and mammography.¹¹ Many methods based on deep transfer learning have been proposed for the detection of lung diseases from CXRs. For example, Hashmi et al.¹² proposed a weighted classifier method for pneumonia detection with the achieved classifier accuracy of 98.43% and the area under the receiver operating characteristic (AU-ROC) score of 99.76 when applied to a pneumonia dataset consisting of 5836 images from the Guangzhou women and children’s medical center. Pham¹³ used an ImageNet model to detect COVID-19 in CXRs and showed that the performance obtained using a transfer learning approach was at least as good as that obtained by training the model from scratch. Ibrahim et al.¹⁴ proposed the pretrained AlexNet model to classify COVID-19 with non-COVID thoracic pathologies and healthy examples based on CXRs obtained from different sources with achieved accuracies above 91% on different classification tasks. Jain et al.¹⁵ conducted experiments with three deep learning models (i.e., InceptionV3, Xception and ResNeXt) on 6342 CXR scans that were collected from Kaggle repository with the accuracy of three models above 93%. Zhang et al.¹⁶ achieved a sensitivity of 71.70% and AUC of 83.61% on the X-COVID dataset using a confidence-aware anomaly detection model trained with the X-VIRAL dataset. Celik¹⁷ achieved an 99.84% accuracy in multiclass classification using CovidDWNet + GB trained on different datasets containing CT images. Moreover, Duong et al.¹⁸ investigated the detection performance of ResNet50 and achieved an accuracy and precision of around 91.93 and 92.3%, respectively, together with a recall of 0.952 and F1-Score of 0.938 on different CXRs sources.

The present study proposes a robust pipeline for the training and evaluation of deep learning models to detect COVID-19 in posterior-anterior CXRs. Each stage of the proposed pipeline is individually investigated and optimized. The feasibility of the pipeline is demonstrated by performing COVID-19 binary classification on the COVID-XRay-5K v3 dataset provided by Minaee et al.¹⁹ Three pre-trained models (ResNet50, NASNet-A-Mobile and Xception) are repurposed for this task with the help of transfer learning. The proposed pipeline incorporates several key strategies for mitigating the effects of dataset imbalance, including (1) using the area under precision-recall curve (AU-PRC) as the primary training metric, (2) performing hyperparameter tuning with k-fold cross-validation and Bayesian optimization, (3) adopting early stopping during model training, and (4) performing threshold moving. The performance of the trained models is evaluated on a control group and compared with that of several other models reported in the literature for the same COVID-XRay-5K v3 dataset. For example, Naviwala et al.²⁰ achieved an AU-ROC score of up to 0.995 while Sadeghi et al.²¹ obtained a maximum accuracy of 98.32% and a 98.82% recall at 95.26% specificity. In addition, Kumar et al.¹⁰ achieved an accuracy of 98.25% and an 86% recall at 99.9% specificity.

Materials and methodology

Figure 1 shows the training and evaluation pipeline proposed in the present study. As shown, the pipeline consists of four main stages, where each stage serves as a foundation for the next. In the first stage, data splitting is used to prepare a well-organized, stratified-split dataset, and an augmentation technique is employed to ensure that the trained models are robust to overfitting. A hyperparameter tuning process, involving k-fold cross-validation and Bayesian optimization, is then applied to tune the hyperparameters of each of the selected models in such a way as to optimize the training performance. In the third stage, the models are trained using these optimized hyperparameters, together with early stopping and threshold moving methods to obtain well-trained models without overfitting. Finally, the performance of the trained models is evaluated on a control group and compared with that of previous models reported in the literature.

Figure 1.

Experimental pipeline for training and evaluating the proposed models.

Data preparation

The present study used the COVID-XRay-5K v3 dataset¹⁹ for training and evaluation purposes. Figure 2 presents some typical CXR images within the dataset. 5000 COVID-19 negative images were uniformly sampled from CheXpert dataset.²² 184 COVID-19 positive images were additionally extracted where these images showed clear evidence of COVID-19 and were annotated with labels accordingly by a board-certified radiologist. The detail description of dataset is shown in Table 1. The extracted images were divided into a training set consisting of 2000 negative images and 84 positive images and a testing set containing 3000 negative images and 100 positive images. This dataset suffered an imbalance problem with an average negative-to-positive sample ratio of approximately 27:1. The training dataset described above was further split into two subsets: a smaller training set and a validation set. As also shown in Table. 1, the splitting procedure was implemented in a stratified manner (optimally preserving a 2000:84 negative-positive ratio), with a training:validation split ratio of 80%:20%.

Figure 2.

Representative CXR images extracted from COVID-XRay-5K v3 dataset.

Table 1.

Class distribution in raw datasets and post-splitting datasets.

	Dataset	Negative image	Positive image
COVID-XRay-5K	Training	2000	84
COVID-XRay-5K	Testing	300	100
Post-splitting datasets	Training	1600	67
	Validation	400	17
	Testing	3000	100

In the first stage of the proposed pipeline (see Figure 1), data augmentation was performed to avoid overfitting and improve the generalization of the models to new data. As shown in Figure 3, all of the images were initially resized to 256 × 256 pixels. Thereafter, the training images were processed by 224 × 224 random cropping, random rotation in range [−10°, +10°] and random horizontal flips to mimic the variations naturally occurring in practical CXR scenarios. By contrast, the images in the validation set and test set were processed only by 224 × 224 center cropping to ensure identical augmented outputs between different runs, and accordingly, consistent validation results. A normalization process was applied to each image based on the custom mean and standard deviation values derived from the original dataset¹⁹, that is, $m e a n = [0.6629, 0.6629, 0.6629],$ and $s t d = [0.0675, 0.0674, 0.0673]$ .

Figure 3.

Data augmentation process.

The models selected in the present study (two moderate-sized models ResNet50²³ and Xception,²⁴ less than 30M parameters each, and a small-sized model NASNet-A-Mobile,²⁵ with less than 6M parameters) were deliberately chosen based on their competitive evaluation scores within the relative size range. All of these models have been pre-trained and benchmarked on ImageNet.²⁶

Hyperparameters

The optimal set of training hyperparameters for each model was determined through Bayesian optimization²⁷ and k-fold cross-validation, respectively. It is noted that these steps require five additional non-tunable hyperparameters, namely the number of epochs, the number of folds, the number of trials, the configurations of the Bayesian utility function (kappa and xi), and the pbounds. The cross-entropy function and area under the precision-recall curve (AU-PRC) were chosen as the loss function and primary evaluation metric, respectively, for both hyperparameter tuning and training.

Performance evaluation metrics

The area under the precision recall curve (AU-PRC) was selected as the primary metric to be optimized on account of its exceptional advantages of overcoming imbalanced datasets and being independent of the threshold.^28,29 It is noted that the AU-PRC was used as the primary training metric to pick the best weights for the proposed models in training process and tried the different thresholds to achieve the higher performance. Moreover, the hyperparameters for each model was optimized by using Bayesian optimization. So that the class weighting or sampling technique was not applied in this study. The original training set was divided into training and validation set for training process and keep the original testing set unchanged (see Table 1).

The confusion matrix is a table summarizing the number of true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN) for a particular model when comparing the groundtruth labels and the corresponding predictions in the sample space.

Five metrics can be derived from the confusion matrix, namely:

Accuracy = \frac{T P + T N}{T P + F P + F N + T N} \times 100 %

(1)

Specificity = \frac{T N}{F P + T N} \times 100 %

(2)

Recall = \frac{T P}{T P + F N} \times 100 %

(3)

Precision = \frac{T P}{T P + F P} \times 100 %

(4)

F 1 - Score = \frac{2 T P}{2 T P + F P + F N} \times 100 %

(5)

It is worth noting that when the decision threshold is shifted, all four numbers of TP, FP, TN, FN might vary. Thus, the metrics in equations (1), (2), (3), (4) and (5) are subject to change in a threshold-moving problem.

Experimental setup

Hyperparameter tuning

The hyperparameter tuning process was implemented as a loop of two black-box functions connected end-to-end, namely model evaluation with k-fold cross-validation and global optimization with Bayesian optimization (see Figure 4). The loop was repeated iteratively until the desired score was obtained, at which point the current values of the model hyperparameters were harvested as the tuning output.

Figure 4.

Schematic illustration of hyperparameter tuning process. Two evaluation models namely model evaluation and global optimization were employed for getting optimization hyper parameters.

Table 2 shows the lower and upper bounds (pbounds), and corresponding scales, of each of the tunable hyperparameters for tuning process. The aim of the hyperparameter tuning process was to find the global optimum of AU-PRC, thus the configuration parameters of the Bayesian utility function were set as kappa = 10 and xi = 0.1, respectively.

Table 2.

Tunable hyperparameters and their bounds.

Tunable hyperparameter	Lower bound	Upper bound	Scale
Batch size	2³	2⁶	Logarithmic
Max learning rate (LR)	10⁻⁴	10⁻²	Logarithmic
Base LR/max LR	10⁻²	10°	Logarithmic
Max momentum	0.85	0.95	Linear
Base momentum	0.75	0.85	Linear
Weight decay	10⁻³	10⁻¹	Logarithmic

After performing Bayesian optimization for 50 runs (trials), the target scores and equivalent sets of hyperparameters were retrieved. In theory, the easiest approach for selecting the optimal set of hyperparameters is simply to choose the parameter set which returns the highest score. However, such an approach works optimally only if the objective function is deterministic. That is, stochasticity of the objective function may well misguide the choice of the best solution. In the present study, two phenomena were observed during the Bayesian optimization process (see Figure 5). Firstly, the AU-PRC scores of the early trials were significantly lower than those of the later trials, and secondly, the query points were distributed unevenly in the variable space, forming several clusters in some ranges and gaps in the others, making it difficult to pinpoint where the actual optimum lay.

Figure 5.

Example of defective trials when running Bayesian optimization to find the optimal weight decay value.

Two strategies were employed to investigate the origins of these phenomena further. In the first strategy, the defective trials were separated from the high-performance trials to determine which of the hyperparameters were responsible for weighing down the score in the early trials. In particular, using the target score, a separating limit was determined based on the box plot policy of categorizing outliers, with the use of quartile (Q) ranges. Any trials with a score lower than this limit were considered to be defective. The lower outlier limit (OL) was calculated as

O L_{l o w} = Q_{1} - 1.5 \times (Q_{3} - Q_{1})

(6)

In the second strategy, the Pearson’s product-moment coefficient was measured between the target score and each hyperparameter. The correlation sense (i.e., positive or negative) and strength (i.e., strong, moderate, weak and negligible) were utilized after considering multiple recommendations, such as checkpoints of linear correlation at 0.1–0.3–0.5, or 0.3–0.5–0.7–0.9.³⁰

Model training and evaluation

The hyperparameter values selected for each model were used to configure the corresponding training loop, where this loop included an SGD optimizer and a cyclic LR scheduler (see Figure 6). In the training process, the training set was used to gradually improve the predictive ability of the model, while the validation set was used firstly to evaluate the degree of overfitting and to flag early stopping.

Figure 6.

Schematic illustration of training loop.

Once the best model had been identified based on its AU-PRC score, it underwent further tuning by adjusting the decision threshold. Then, the tuning process was performed using the F1-score metric since it is threshold-dependent but unbiased with dataset imbalance. Furthermore, it is based on the precision and recall metrics, and can thus be easily mapped and visualized onto the same space as the PR curve. For each type of model, five candidates were trained on the training set and then evaluated on the validation set. The best candidate was then loaded from its saved parameters and used to perform a single feedforward pass through all of the samples in the test set.

Results and discussion

Hyperparameter tuning

In this study, all experiments were conducted on the platform Google Colaboratory, or Colab for short. In each Colab instance, PyTorch and various libraries (support for training and evaluation process) run directly in the browser, and receive free access to a Linux operating system virtual machine from Google with Intel Xeon processor with 2.2 GHz, an NVIDIA Tesla T4 with 16 GB.

The ResNet50, Xception and NASNet-A-Mobile models underwent hyperparameter tuning for a total of 50 Bayesian optimization trials × 5-fold cross-validation × 10 epochs. The tuning results for each model are summarized in Table 3. In general, being the most lightweight between three models, the NASNet-A-Mobile model returned the shortest tuning time. The ResNet50 and Xception models are heavier and have a comparable size. However, the training time of the ResNet50 model was significantly shorter than that of the Xception model. The performance improvement of ResNet50 is most likely the result of it having skip connections with easy-to-calculate gradients, whereas Xception involves extensive convolutions and thus has an extended convergence time. The overall AU-PRC scores in Table 3 indicate that ResNet50 found better solutions with less noise than the other models. More significantly, the second quartile of ResNet50 equaled the maximum score. In at least half of the trials, all of the ResNet50 models achieved perfect predictions on the validation data in at least one epoch. Moreover, based on the quartiles, the lower outlier limit for each model was inferred to provide a reliable baseline to distinguish between the high-performance trials and the defective trials (see Figure 7). It is noted that the defective trials played a trivial role in identifying the optimal set of hyperparameters, but were useful in highlighting the parameter values responsible for the low scores. The high performance and defective trials were separated in order to distinguish the effects of the different hyperparameters on the target score. It was found that a large LR (equivalent to max LR in the experiment) was strongly associated with a poorer model performance due to its effect of increasing the stochasticity during the training process, which then caused the model to miss the optima. The batch size exhibited a similar effect since a larger batch size invariably resulted in a lower accuracy. Thus, lowering the LR and decreasing the batch size is instrumental in improving the model training results, particularly in the case of fine-tuning.³¹ In the present experiments, the maximum target value was obtained using a batch size of 16. In addition, the weight decay and LR ratio enhanced the results at intermediate values around the geometric means of the boundaries, and lessened them elsewhere. However, while a weight decay of 10⁻² is commonly used in the ML community, the LR ratio has no particular default. The remaining momentum hyperparameters showed no significant correlation with the target score.

Table 3.

Hyperparameter tuning results.

		ResNet50	Xception	NASNet-A-Mobile
Total tuning time		29h 19 m 02s	37h 13m 43s	28h 49 m 34s
Overall AU-PRC		0.9618 ± 0.0982	0.8855 ± 0.2409	0.9270 ± 0.1540
Quartiles	Q3	1.0000	1.0000	1.0000
	Q2	1.0000	0.9919	0.9973
	Q1	0.9812	0.9199	0.9608
Lower outlier limit		0.9530	0.7997	0.9020

Figure 7.

Hyperparameter tuning results for NASNet-A-Mobile model.

Table 4 shows the Pearson correlation coefficients of the various hyperparameters with respect to the AU-PRC score. The ratio of the base LR to the max LR has only a weak negative correlation with the AU-PRC score for all three models. Besides, as the weight decay shows no significant correlation, it was assigned the conventional value of

10^{- 2}

in every case. For the ResNet50 model, the max momentum parameter shows a weak negative correlation with the AU-PRC score, and was assigned a value of 0.85. For the other models, the momentum parameters were simply assigned the mean values of the respective bounds, which were also the default values (0.9 for max and 0.8 for base momentum) in the cyclic LR strategy.

Table 4.

Pearson correlation coefficients of each hyperparameter versus AU-PRC score and chosen values.

Hyperparameter	Pearson correlation coefficients of each hyperparameter versus AU-PRC			Chosen value
Hyperparameter	ResNet50	Xception	NASNet-A-Mobile	Chosen value
Batch size	−0.2361	−0.2466	−0.1184	16
Max LR	−0.5115	−0.4409	−0.6540	$10^{- 4}$
Base LR/max LR	−0.0832	−0.2119	−0.2205	$10^{- 1}$
Max momentum	−0.2054	−0.0161	0.0281	0.85 for ResNet50 0.90 for others
Base momentum	−0.0081	−0.0657	−0.0365	0.8
Weight decay	−0.0827	0.0713	−0.0512	$10^{- 2}$

Model training and evaluation

The chosen hyperparameter values were used to train the five candidates for each type of model. For illustration purposes, Figure 8 shows the training and validation results obtained for one of the NASNet-A-Mobile candidates. In this case, the predictive ability of the model peaked at epoch 5 and the training process was stopped after 41 epochs.

Figure 8.

Model training performance of NASNet-A-Mobile candidate.

Table 5 shows the training performance results for the three models when applied to the validation set. A close inspection shows that all of the models achieved a peak AU-PRC score after around 60% of the training time before being stopped early. In addition, the results obtained for the training time and number of training epochs indicate that ResNet50 was the fastest model to converge in terms of both the total training time and the number of iterations. The ResNet50 model also showed the most consistent learning progress over all the candidate models. Moreover, the evaluation metrics reveal that the ResNet50 candidates generally outperform the NASNet-A-Mobile and Xception candidates, with the exception of the AU-ROC metric, for which the Xception model achieves the highest score.

Table 5.

Training results obtained from candidate models on validation set.

Metric	ResNet50	Xception	NASNet-A-Mobile
Training time (s)	1453 ± 129.9	4703 ± 1280	2390 ± 985.3
Training epochs	23.2 ± 1.789	55.0 ± 15.36	39.4 ± 16.71
Highest AU-PRC at epoch	0.985 ± 0.016 at 14.8 ± 3.962	0.953 ± 0.021 at 29.4 ± 12.93	0.974 ± 0.015 at 22.2 ± 23.16
AU-ROC	0.993 ± 0.013	0.997 ± 0.002	0.987 ± 0.015
Highest F1-score at threshold	0.976 ± 0.013 at 0.020 ± 0.013	0.897 ± 0.018 at 0.045 ± 0.024	0.959 ± 0.032 at 0.079 ± 0.074
Accuracy	99.81% ± 0.11%	99.14% ± 0.13%	99.66% ± 0.27%
Precision	98.89% ± 2.48%	86.91% ± 2.14%	96.60% ± 5.01%
Recall	96.47% ± 3.22%	92.94% ± 4.92%	95.29% ± 2.63%
Specificity	99.95% ± 0.11%	99.40% ± 0.14%	99.85% ± 0.22%

Finally, the best candidate of each type of model (as determined by the highest AU-PRC score) was applied to the test set. Figures 9, 10, 11 and 12 present some typical classification results obtained for different CXR images in the dataset. Note that for convenience, the positive and negative COVID labels and predictions are denoted as (+) and (−), respectively, while the ResNet50, NASNet-A-Mobile and Xception models are denoted simply as R, N and X respectively. In addition, false predictions are marked in red. It is noted that decision thresholds equivalent to 98% recall (to be listed in Table 5) were used in every case.

Figure 9.

True positives classification results obtained for different CXR images in the dataset (5 random examples).

Figure 10.

False Negatives classification results obtained for different CXR images in the dataset (all four examples).

Figure 11.

True Negatives classification results obtained for different CXR images in the dataset (5 random examples).

Figure 12.

False Positives classification results obtained for different CXR images in the dataset (5 random examples).

In general, the models successfully predicted the presence or absence of COVID-19. Obviously, the predicted percentages were altogether high for TP cases and low for TN cases. For all FN instances, there was always at least one model that could predict correctly. However, some particular image patterns may adversely affect the prediction performance, such as severe disorientation and misaligned borders. For example, the Test0080 image in Figure 10 shows slight signs of opacity or consolidation, while the Test0137 image in Figure 12 shows patterns akin to ground-glass opacities. These patterns may be too intense to be offset by data augmentation, thereby degrading the ability of the trained models to recognize the patterns essential for COVID detection. The models sometimes failed in the case where the images bore strong similarities to the opposite class. Thus, the full set of performance metrics was utilized to properly evaluate the model performance. Besides, a statistical test³² was carried out to investigate whether the error rate of two algorithms is significantly different. The corresponding results are presented in Table 6. The p-value of

p_{R, X}

< 0.05 indicates that the error rate between ResNet50 and Xception model is significant different. Thus, the performance of the trained ResNet50 is dissimilar to the trained Xception. Furthermore, NASNET-A-Mobile is significantly different from Xception. Conversely, the p-value of

p_{N, R}

> 0.05 indicates that ResNet50 and NASNET-A-Mobile are not different from each other. The results in Table 6 also show that the performance of ResNet50 and NASNet-A-Mobile has slight differences but both models outperform Xception on all metrics. The overall results indicate that ResNet50 is the fittest model, followed by NASNet-A-Mobile, and then Xception (see the bold values in Table 6). In other words, the test results are consistent with those obtained in the earlier hyperparameter tuning and model training stages. Figure 13 shows the ROC and PR curves of the three models, which aligns correspondingly with Table 6.

Table 6.

Evaluation results obtained by trained models on test set.

Metric	ResNet50	Xception	NASNet-A-Mobile
AU-PRC	0.977	0.900	0.963
AU-ROC	0.994	0.981	0.996
Decision threshold	0.022	0.080	0.176
Confusion matrix	$[\begin{array}{l} 94 (T P) & 6 (F N) \\ 6 (F P) & 2994 (T N) \end{array}]$	$[\begin{array}{l} 89 & 11 \\ 27 & 2973 \end{array}]$	$[\begin{array}{l} 9 & 10 \\ 7 & 2993 \end{array}]$
F1-score	0.940	0.824	0.914
Accuracy	99.61%	98.77%	99.45%
Precision	94.00%	76.72%	92.78%
Recall	94.00%	89.00%	90.00%
Specificity	99.80%	99.10%	99.77%
p-value	$p_{R, X} = 0.0002$	$p_{X, N} = 0.004$	$p_{N, R} = 0.352$

Figure 13.

ROC and PR curves of three models.

While the classification accuracy was not the chief focus of the present study, it is noted that the present models all achieve an excellent performance. Moreover, while the difference in accuracy of present models versus models in other studies was marginal (≤1.5%), the difference in AU-PRC was notably larger (≈7%). Once again, this implies that AU-PRC is a suitable metric for imbalanced datasets. Since it is essential to achieve a high recall for COVID-19 detection, the specificity of each model was evaluated at 98% recall, as shown in Table 7. It is seen that the present ResNet50 and NASNet-A-Mobile models outperform most of the models reported in the literature. In general, the results presented in this study suggest that all three models provide a useful tool for COVID-19 infection screening. For example, ResNet50 (the best performing model) can be integrated into a desktop application as an alternative screening method to real-time PCR and other rapid test methods. Moreover, NASNet-A-Mobile (the most lightweight model) can be developed into a mobile application in order to support rapid preliminary COVID-19 diagnosis. Table 7 shows the comparison of performance results for present models and those reported in the literature on the COVID-Xray-5K v3 dataset. As shown, the proposed models have a higher value of AU-ROC and AU-PRC than that of obtained from Refs. 10,^19,20 and ²¹ The accuracy of the proposed model is 98.77% compatible with reference 10 of 99.45% and higher than those obtained from,^19,20 and.²¹ Furthermore, the specificity values of the proposed model are higher than those obtained from^19,21 and compatible with those obtained from.^10,21

Table 7.

Comparison of performance results for present models and those reported in the literature on the COVID-Xray-5K v3 dataset.

Models	Accuracy	Sensitivity	Specificity	AU-ROC	AU-PRC	Ref
SVM VGG16 VGG19 The CNN architectures classifies of Ref. 10	99.45%	0.94	99.90% at 86.00% recall	-	-	¹⁰
ResNet50 ResNet18 SqueezeNet DenseNet-121	98.00%	0.98	92.90%	0.992	0.897	¹⁹
VGG-16 AlexNet ResNet18 Inception V3	-	0.94	99.70%	0.995	-	²⁰
Deep CNN	94.00%	0.95	92.22%	-	-	²¹
NASNet-A-Mobile Xception ResNet50	98.77%		97.60% at 98.00% recall	0.996	0.977	Proposed study

Beside advantages, the proposed study with several limitations needs to be acknowledged. Firstly, there are different COVID-19 datasets from medical institutions with different quality images that cause a correction in the performance of the proposed models. Ibrahim et al.¹⁴ and Duong et al.¹⁸ showed that the achieved accuracies of their models were quite different when their proposed classifiers were evaluated by the different CXR datasets from other medical institutions. Secondly, the COVID-Xray-5K v3 dataset suffered an imbalance problem with an average negative-to-positive sample ratio of approximately 27:1. To overcome this limitation, the proposed study tried to find the appropriate evaluation metrics to store the best weights of the present models for the dataset. Finally, the proposed study focused on three models (i.e., ResNet50, Xception and NASNet-A-Mobile) and some methods for finding the optimal values of hyperparameters. In the future, more classification models are employed for the purpose of comparison and optimization process.

Conclusion

This study has presented a robust pipeline for the development of well-trained models for the detection of COVID-19 from CXRs. The pipeline takes specific account of two of the most common problems affecting the deep learning processing of medical images, namely dataset imbalance and threshold moving. The feasibility of the proposed pipeline has been demonstrated using three models (ResNet50, Xception and NASNet-A-Mobile). The performance of the three models has been evaluated and analyzed on the COVID-Xray-5K v3 dataset. The results have shown that the ResNet50, NASNet-A-Mobile and Xception models achieve AU-PRC scores of 0.977, 0.963 and 0.900, respectively, and AU-ROC scores of 0.994, 0.996 and 0.981. Furthermore, at a 98.00% recall, the three models achieve high specificities of 97.53%, 97.60% and 86.00%, respectively. Notably, the predictive performance of the three models is at least as good as that of five other models reported in the literature.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Vietnam National University Ho Chi Minh City (VNUHCM) under Grant No. DS2023-28-02.

Code availability

The source code employed in this study is publicly available on Github repository []. The repository contains all Python scripts to train and evaluate the proposed models.

Ethical statement

Based on the COVID-XRay-5K v3 dataset published online for academic community, there is no cause for concern regarding ethical considerations in this study.

ORCID iD

Thi-Thu-Hien Pham

Data availability statement

Data underlying the results presented in this paper are publicly available on Github repository [] from.¹⁹

References

Chriscaden

. Impact of COVID-19 on people’s livelihoods, their health and our food systems. WHO, 2020. https://www.who.int/news/item/13-10-2020-impact-of-covid-19-on-people's-livelihoods-their-health-and-our-food-systems (Accessed 04 September 2022).

Yao

Fan

, et al. Compound impact of COVID-19, economy and climate on the spatial distribution of global agriculture and food security. Sci Total Environ 2023; 880: 163105. DOI: 10.1016/j.scitotenv.2023.163105.

Maher

Noorden

. How the COVID pandemic is changing global science collaborations. Nature 2021. https://www.nature.com/articles/d41586-021-01570-2 (Accessed 04 September 2022).

Wang

Zhang

. The COVID-19 pandemic and supply chain: international cooperation patterns and influence mechanism. Benchmark Int J 2024; 31(2): 466–486. DOI: 10.1108/BIJ-04-2022-0257.

Liu

Xing

Zhao

, et al. A new classification method for diagnosing COVID-19 pneumonia based on joint CNN features of chest X-ray images and parallel pyramid MLP-mixer module. Neural Comput Appl 2023; 35(23): 1–13. DOI: 10.1007/s00521-023-08604-y.

Lundervold

. An overview of deep learning in medical imaging focusing on MRI. Z Med Phys 2019; 29(2): 102–127. DOI: 10.1016/j.zemedi.2018.11.002.

Ruberto

Loddo

Puglisi

. Blob detection and deep learning for leukemic blood image analysis. Appl Sci 2020; 10(3): 1176. DOI: 10.3390/app10031176.

Huang

, et al. Deep supervised learning using self-adaptive auxiliary loss for COVID-19 diagnosis from imbalanced CT images. Neurocomputing 2021; 458: 232–245. DOI: 10.1016/j.neucom.2021.06.012.

van Sloun

RJG

Cohen

Eldar

. Deep learning in ultrasound imaging. Proc IEEE 2020; 108(1): 11–29. DOI: 10.1109/JPROC.2019.2932116.

10.

Kumar

. Classification of COVID-19 X-ray images using transfer learning with visual geometrical groups and novel sequential convolutional neural networks. MethodsX 2023; 11: 102295. DOI: 10.1016/j.mex.2023.102295.

11.

Shen

Margolies

Rothstein

, et al. Deep learning to improve breast cancer detection on screening mammography. Sci Rep 2019; 9(1): 12495. DOI: 10.1038/s41598-019-48995-4.

12.

Hashmi

Katiyar

Keskar

, et al. Efficient pneumonia detection in chest Xray images using deep transfer learning. Diagnostics 2020; 10(6): 417. DOI: 10.3390/diagnostics10060417.

13.

Pham

. Classification of COVID-19 chest X-rays with deep learning: new models or fine tuning? Health Inf Sci Syst 2020; 9: 2. DOI: 10.1007/s13755-020-00135-3.

14.

Ibrahim

Ozsoz

Serte

, et al. Pneumonia classification using deep learning from chest X-Ray images during COVID-19. Cognit Comput 2021: 1–13. DOI: 10.1007/s12559-020-09787-5.

15.

Jain

Gupta

Taneja

, et al. Deep learning based detection and analysis of COVID-19 on chest X-ray images. Appl Intell 2021; 51: 1690–1700. DOI: 10.1007/s10489-020-01902-1.

16.

Zhang

Xie

Pang

, et al. Viral pneumonia screening on chest X-rays using confidence-aware anomaly detection. IEEE Trans Med Imag 2021; 40(3): 879–890. DOI: 10.1109/TMI.2020.3040950.

17.

Celik

. Detection of Covid-19 and other pneumonia cases from CT and X-ray chest images using deep learning based on feature reuse residual block and depthwise dilated convolutions neural network. Appl Soft Comput 2023; 133: 109906. DOI: 10.1016/j.asoc.2022.109906.

18.

Duong

Nguyen

Iovino

, et al. Automatic detection of Covid-19 from chest X-ray and lung computed tomography images using deep neural networks and transfer learning. Appl Soft Comput 2023; 132: 109851. DOI: 10.1016/j.asoc.2022.109851.

19.

Minaee

Kafieh

Sonka

, et al. Deep-COVID: predicting COVID-19 from chest X-ray images using deep transfer learning. Med Image Anal 2020; 65: 101794. DOI: 10.1016/j.media.2020.101794.

20.

Naviwala

Qureshi

. Performance analysis of deep learning frameworks for COVID-19 detection. In: 2021 International Conference on Digital Futures and Transformative Technologies (ICoDT2), 2021, pp. 1–6. DOI: 10.1109/ICoDT252288.2021.9441537.

21.

Sadeghi

Rostami

M-K

, et al. A deep learning approach for detecting covid-19 using the chest X-ray營mages. Comput Mater Continua (CMC) 2023; 74(1): 751–768. DOI: 10.32604/cmc.2023.031519.

22.

Irvin

Rajpurkar

, et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. Proc. AAAI'19/IAAI'19/EAAI 2019; 33: 590–597. DOI: 10.1609/aaai.v33i01.3301590.

23.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: 2016 IEEE CVPR, 2016, pp. 770–778. DOI: 10.1109/CVPR.2016.90.

24.

Chollet

. Xception: deep learning with depthwise separable convolutions. In: 2017 IEEE CVPR, 2017, pp. 1800–1807. DOI: 10.1109/CVPR.2017.195.

25.

Zoph

Vasudevan

Shlens

, et al. Learning transferable architectures for scalable image recognition. In: 2018 IEEE CVPR, 2018, pp. 8697–8710. DOI: 10.1109/CVPR.2018.00907.

26.

Cadene

. Pretrained models for pytorch. 2018. https://github.com/Cadene/pretrained-models.pytorch (Accessed 16 August 2021).

27.

Mockus

. Bayesian approach to global optimization: theory and applications. Mathematics and its Applications 1989; 37. DOI: 10.1007/978-94-009-0909-0.

28.

Saito

Rehmsmeier

. The Precision-Recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 2015; 10(3): e0118432. DOI: 10.1371/journal.pone.0118432.

29.

Davis

Goadrich

. The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd international conference on Machine learning, 2006, pp. 233–240. DOI: 10.1145/1143844.1143874.

30.

Mukaka

. Statistics corner: a guide to appropriate use of correlation coefficient in medical research. Malawi Med J 2012; 24(3): 69–71.

31.

Kandel

Castelli

. The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset. ICT Express 2020; 6(4): 312–315. DOI: 10.1016/j.icte.2020.04.010.

32.

Dietterich

. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 1998; 10(7): 1895–1923. DOI: 10.1162/089976698300017197.

Detection of COVID-19 from chest X-rays using deep transfer learning

Abstract

Keywords

Introduction

Materials and methodology

Data preparation

Hyperparameters

Performance evaluation metrics

Experimental setup

Hyperparameter tuning

Model training and evaluation

Results and discussion

Hyperparameter tuning

Model training and evaluation

Conclusion

Footnotes

Declaration of conflicting interests

Funding

Code availability

Ethical statement

ORCID iD

Data availability statement

References